the answer question in question answering systems · chitecture: question interpretation, passage...

57
The Answer Question in Question Answering Systems Andr´ e Filipe Esteves Gon¸ calves Disserta¸c˜ ao para obten¸ c˜ao do Grau de Mestre em Engenharia Inform´ atica e de Computadores uri Presidente: Prof. Jos´ e Manuel Nunes Salvador Tribolet Orientador: Prof a Maria Lu´ ısa Torres Ribeiro Marques da Silva Coheur Vogal: Prof. Bruno Emanuel da Gra¸ca Martins Novembro 2011

Upload: others

Post on 02-Aug-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

The Answer Question in Question

Answering Systems

Andre Filipe Esteves Goncalves

Dissertacao para obtencao do Grau de Mestre em

Engenharia Informatica e de Computadores

Juri

Presidente Prof Jose Manuel Nunes Salvador TriboletOrientador Profa Maria Luısa Torres Ribeiro Marques da Silva CoheurVogal Prof Bruno Emanuel da Graca Martins

Novembro 2011

Acknowledgements

I am extremely thankful to my Supervisor Prof Luısa Coheur and also mycolleague Ana Cristina Mendes (who pratically became my Co-Supervisor) forall their support throughout this journey I could not have written this thesiswithout them truly

I dedicate this to everyone that has inspired me and motivates me on a dailybasis over the past year (or years) ndash ie my close friends my chosen family Itwas worthy Thank you you are my heroes

Resumo

No mundo atual os Sistemas de Pergunta Resposta tornaram-se uma respostavalida para o problema da explosao de informacao da Web uma vez que con-seguem efetivamente compactar a informacao que estamos a procura numa unicaresposta Para esse efeito um novo Sistema de Pergunta Resposta foi criadoJustAsk

Apesar do JustAsk ja estar completamente operacional e a dar respostascorretas a um certo numero de questoes permanecem ainda varios problemas aenderecar Um deles e a identificacao erronea de respostas a nıvel de extracaoe clustering que pode levar a que uma resposta errada seja retornada pelosistema

Este trabalho lida com esta problematica e e composto pelas seguintes tare-fas a) criar corpora para melhor avaliar o sistema e o seu modulo de respostasb) apresentar os resultados iniciais do JustAsk com especial foco para o seumodulo de respostas c) efetuar um estudo do estado da arte em relacoes en-tre respostas e identificacao de parafrases de modo a melhorar a extracao dasrespostas do JustAsk d) analizar erros e detectar padroes de erros no mdulode extracao de respostas do sistema e) apresentar e implementar uma solucaopara os problemas detectados Isto incluira uma normalizacaointegracao dasrespostas potencialmente corretas juntamente com a implementacao de novastecnicas de rdquoclusteringrdquo f) avaliacao do sistema com as novas adicoes propostas

Abstract

In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed

While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system

This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed

Keywords

JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction

6

Index

1 Introduction 12

2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15

221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15

222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-

ilarity measures 1923 Overview 22

3 JustAsk architecture and work points 2431 JustAsk 24

311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25

32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30

33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33

34 Error Analysis over the Answer Module 34

4 Relating Answers 3841 Improving the Answer Extraction module

Relations Module Implementation 3842 Additional Features 41

421 Normalizers 41422 Proximity Measure 41

43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42

5 Clustering Answers 4451 Testing Different Similarity Metrics 44

511 Clusterer Metrics 44512 Results 46513 Results Analysis 47

52 New Distance Metric - rsquoJFKDistancersquo 48

7

521 Description 48522 Results 50523 Results Analysis 50

53 Possible new features ndash Combination of distance metrics to en-hance overall results 50

54 Posterior analysis 51

6 Conclusions and Future Work 53

8

List of Figures

1 JustAsk architecture mirroring the architecture of a general QAsystem 24

2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the

command line 294 Proposed approach to automate the detection of relations be-

tween answers 385 Relationsrsquo module basic architecture Each answer has now a field

representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40

6 Execution of the application to create relationsrsquo corpora in thecommand line 43

9

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 2: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Acknowledgements

I am extremely thankful to my Supervisor Prof Luısa Coheur and also mycolleague Ana Cristina Mendes (who pratically became my Co-Supervisor) forall their support throughout this journey I could not have written this thesiswithout them truly

I dedicate this to everyone that has inspired me and motivates me on a dailybasis over the past year (or years) ndash ie my close friends my chosen family Itwas worthy Thank you you are my heroes

Resumo

No mundo atual os Sistemas de Pergunta Resposta tornaram-se uma respostavalida para o problema da explosao de informacao da Web uma vez que con-seguem efetivamente compactar a informacao que estamos a procura numa unicaresposta Para esse efeito um novo Sistema de Pergunta Resposta foi criadoJustAsk

Apesar do JustAsk ja estar completamente operacional e a dar respostascorretas a um certo numero de questoes permanecem ainda varios problemas aenderecar Um deles e a identificacao erronea de respostas a nıvel de extracaoe clustering que pode levar a que uma resposta errada seja retornada pelosistema

Este trabalho lida com esta problematica e e composto pelas seguintes tare-fas a) criar corpora para melhor avaliar o sistema e o seu modulo de respostasb) apresentar os resultados iniciais do JustAsk com especial foco para o seumodulo de respostas c) efetuar um estudo do estado da arte em relacoes en-tre respostas e identificacao de parafrases de modo a melhorar a extracao dasrespostas do JustAsk d) analizar erros e detectar padroes de erros no mdulode extracao de respostas do sistema e) apresentar e implementar uma solucaopara os problemas detectados Isto incluira uma normalizacaointegracao dasrespostas potencialmente corretas juntamente com a implementacao de novastecnicas de rdquoclusteringrdquo f) avaliacao do sistema com as novas adicoes propostas

Abstract

In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed

While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system

This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed

Keywords

JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction

6

Index

1 Introduction 12

2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15

221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15

222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-

ilarity measures 1923 Overview 22

3 JustAsk architecture and work points 2431 JustAsk 24

311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25

32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30

33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33

34 Error Analysis over the Answer Module 34

4 Relating Answers 3841 Improving the Answer Extraction module

Relations Module Implementation 3842 Additional Features 41

421 Normalizers 41422 Proximity Measure 41

43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42

5 Clustering Answers 4451 Testing Different Similarity Metrics 44

511 Clusterer Metrics 44512 Results 46513 Results Analysis 47

52 New Distance Metric - rsquoJFKDistancersquo 48

7

521 Description 48522 Results 50523 Results Analysis 50

53 Possible new features ndash Combination of distance metrics to en-hance overall results 50

54 Posterior analysis 51

6 Conclusions and Future Work 53

8

List of Figures

1 JustAsk architecture mirroring the architecture of a general QAsystem 24

2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the

command line 294 Proposed approach to automate the detection of relations be-

tween answers 385 Relationsrsquo module basic architecture Each answer has now a field

representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40

6 Execution of the application to create relationsrsquo corpora in thecommand line 43

9

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 3: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

I dedicate this to everyone that has inspired me and motivates me on a dailybasis over the past year (or years) ndash ie my close friends my chosen family Itwas worthy Thank you you are my heroes

Resumo

No mundo atual os Sistemas de Pergunta Resposta tornaram-se uma respostavalida para o problema da explosao de informacao da Web uma vez que con-seguem efetivamente compactar a informacao que estamos a procura numa unicaresposta Para esse efeito um novo Sistema de Pergunta Resposta foi criadoJustAsk

Apesar do JustAsk ja estar completamente operacional e a dar respostascorretas a um certo numero de questoes permanecem ainda varios problemas aenderecar Um deles e a identificacao erronea de respostas a nıvel de extracaoe clustering que pode levar a que uma resposta errada seja retornada pelosistema

Este trabalho lida com esta problematica e e composto pelas seguintes tare-fas a) criar corpora para melhor avaliar o sistema e o seu modulo de respostasb) apresentar os resultados iniciais do JustAsk com especial foco para o seumodulo de respostas c) efetuar um estudo do estado da arte em relacoes en-tre respostas e identificacao de parafrases de modo a melhorar a extracao dasrespostas do JustAsk d) analizar erros e detectar padroes de erros no mdulode extracao de respostas do sistema e) apresentar e implementar uma solucaopara os problemas detectados Isto incluira uma normalizacaointegracao dasrespostas potencialmente corretas juntamente com a implementacao de novastecnicas de rdquoclusteringrdquo f) avaliacao do sistema com as novas adicoes propostas

Abstract

In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed

While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system

This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed

Keywords

JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction

6

Index

1 Introduction 12

2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15

221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15

222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-

ilarity measures 1923 Overview 22

3 JustAsk architecture and work points 2431 JustAsk 24

311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25

32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30

33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33

34 Error Analysis over the Answer Module 34

4 Relating Answers 3841 Improving the Answer Extraction module

Relations Module Implementation 3842 Additional Features 41

421 Normalizers 41422 Proximity Measure 41

43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42

5 Clustering Answers 4451 Testing Different Similarity Metrics 44

511 Clusterer Metrics 44512 Results 46513 Results Analysis 47

52 New Distance Metric - rsquoJFKDistancersquo 48

7

521 Description 48522 Results 50523 Results Analysis 50

53 Possible new features ndash Combination of distance metrics to en-hance overall results 50

54 Posterior analysis 51

6 Conclusions and Future Work 53

8

List of Figures

1 JustAsk architecture mirroring the architecture of a general QAsystem 24

2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the

command line 294 Proposed approach to automate the detection of relations be-

tween answers 385 Relationsrsquo module basic architecture Each answer has now a field

representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40

6 Execution of the application to create relationsrsquo corpora in thecommand line 43

9

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 4: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Resumo

No mundo atual os Sistemas de Pergunta Resposta tornaram-se uma respostavalida para o problema da explosao de informacao da Web uma vez que con-seguem efetivamente compactar a informacao que estamos a procura numa unicaresposta Para esse efeito um novo Sistema de Pergunta Resposta foi criadoJustAsk

Apesar do JustAsk ja estar completamente operacional e a dar respostascorretas a um certo numero de questoes permanecem ainda varios problemas aenderecar Um deles e a identificacao erronea de respostas a nıvel de extracaoe clustering que pode levar a que uma resposta errada seja retornada pelosistema

Este trabalho lida com esta problematica e e composto pelas seguintes tare-fas a) criar corpora para melhor avaliar o sistema e o seu modulo de respostasb) apresentar os resultados iniciais do JustAsk com especial foco para o seumodulo de respostas c) efetuar um estudo do estado da arte em relacoes en-tre respostas e identificacao de parafrases de modo a melhorar a extracao dasrespostas do JustAsk d) analizar erros e detectar padroes de erros no mdulode extracao de respostas do sistema e) apresentar e implementar uma solucaopara os problemas detectados Isto incluira uma normalizacaointegracao dasrespostas potencialmente corretas juntamente com a implementacao de novastecnicas de rdquoclusteringrdquo f) avaliacao do sistema com as novas adicoes propostas

Abstract

In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed

While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system

This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed

Keywords

JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction

6

Index

1 Introduction 12

2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15

221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15

222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-

ilarity measures 1923 Overview 22

3 JustAsk architecture and work points 2431 JustAsk 24

311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25

32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30

33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33

34 Error Analysis over the Answer Module 34

4 Relating Answers 3841 Improving the Answer Extraction module

Relations Module Implementation 3842 Additional Features 41

421 Normalizers 41422 Proximity Measure 41

43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42

5 Clustering Answers 4451 Testing Different Similarity Metrics 44

511 Clusterer Metrics 44512 Results 46513 Results Analysis 47

52 New Distance Metric - rsquoJFKDistancersquo 48

7

521 Description 48522 Results 50523 Results Analysis 50

53 Possible new features ndash Combination of distance metrics to en-hance overall results 50

54 Posterior analysis 51

6 Conclusions and Future Work 53

8

List of Figures

1 JustAsk architecture mirroring the architecture of a general QAsystem 24

2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the

command line 294 Proposed approach to automate the detection of relations be-

tween answers 385 Relationsrsquo module basic architecture Each answer has now a field

representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40

6 Execution of the application to create relationsrsquo corpora in thecommand line 43

9

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 5: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Abstract

In todayrsquos world Question Answering (QA) Systems have become a viableanswer to the problem of information explosion over the web since they effec-tively compact the information we are looking for into a single answer For thateffect a new QA system called JustAsk was designed

While JustAsk is already fully operable and giving correct answers to cer-tain questions there are still several problems that need to be addressed Oneof them is the erroneous identification of potential answers and clustering ofanswers which can lead to a wrong answer being returned by the system

This work deals with this problem and is composed by the following tasksa) create test corpora to properly evaluate the system and its answers moduleb) present the baseline results of JustAsk with special attention to its an-swers module c) perform a state of the art in relating answers and paraphraseidentification in order to improve the extraction of answers of JustAsk d) ana-lyze errors and detect error patterns over the answer extraction stage e) presentand implement a solution to the problems detected This will include normaliza-tionintegration of the potential answers as well as implementing new clusteringtechniques f) Evaluation of the system with the new features proposed

Keywords

JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction

6

Index

1 Introduction 12

2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15

221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15

222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-

ilarity measures 1923 Overview 22

3 JustAsk architecture and work points 2431 JustAsk 24

311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25

32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30

33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33

34 Error Analysis over the Answer Module 34

4 Relating Answers 3841 Improving the Answer Extraction module

Relations Module Implementation 3842 Additional Features 41

421 Normalizers 41422 Proximity Measure 41

43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42

5 Clustering Answers 4451 Testing Different Similarity Metrics 44

511 Clusterer Metrics 44512 Results 46513 Results Analysis 47

52 New Distance Metric - rsquoJFKDistancersquo 48

7

521 Description 48522 Results 50523 Results Analysis 50

53 Possible new features ndash Combination of distance metrics to en-hance overall results 50

54 Posterior analysis 51

6 Conclusions and Future Work 53

8

List of Figures

1 JustAsk architecture mirroring the architecture of a general QAsystem 24

2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the

command line 294 Proposed approach to automate the detection of relations be-

tween answers 385 Relationsrsquo module basic architecture Each answer has now a field

representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40

6 Execution of the application to create relationsrsquo corpora in thecommand line 43

9

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 6: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Keywords

JustAskQuestion AnsweringRelations between AnswersParaphrase IdentificationClusteringNormalizationCorpusEvaluationAnswer Extraction

6

Index

1 Introduction 12

2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15

221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15

222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-

ilarity measures 1923 Overview 22

3 JustAsk architecture and work points 2431 JustAsk 24

311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25

32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30

33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33

34 Error Analysis over the Answer Module 34

4 Relating Answers 3841 Improving the Answer Extraction module

Relations Module Implementation 3842 Additional Features 41

421 Normalizers 41422 Proximity Measure 41

43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42

5 Clustering Answers 4451 Testing Different Similarity Metrics 44

511 Clusterer Metrics 44512 Results 46513 Results Analysis 47

52 New Distance Metric - rsquoJFKDistancersquo 48

7

521 Description 48522 Results 50523 Results Analysis 50

53 Possible new features ndash Combination of distance metrics to en-hance overall results 50

54 Posterior analysis 51

6 Conclusions and Future Work 53

8

List of Figures

1 JustAsk architecture mirroring the architecture of a general QAsystem 24

2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the

command line 294 Proposed approach to automate the detection of relations be-

tween answers 385 Relationsrsquo module basic architecture Each answer has now a field

representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40

6 Execution of the application to create relationsrsquo corpora in thecommand line 43

9

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 7: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Index

1 Introduction 12

2 Related Work 1421 Relating Answers 1422 Paraphrase Identification 15

221 SimFinder ndash A methodology based on linguistic analysisof comparable corpora 15

222 Methodologies purely based on lexical similarity 16223 Canonicalization as complement to lexical features 18224 Use of context to extract similar phrases 18225 Use of dissimilarities and semantics to provide better sim-

ilarity measures 1923 Overview 22

3 JustAsk architecture and work points 2431 JustAsk 24

311 Question Interpretation 24312 Passage Retrieval 25313 Answer Extraction 25

32 Experimental Setup 26321 Evaluation measures 26322 Multisix corpus 27323 TREC corpus 30

33 Baseline 31331 Results with Multisix Corpus 31332 Results with TREC Corpus 33

34 Error Analysis over the Answer Module 34

4 Relating Answers 3841 Improving the Answer Extraction module

Relations Module Implementation 3842 Additional Features 41

421 Normalizers 41422 Proximity Measure 41

43 Results 4244 Results Analysis 4245 Building an Evaluation Corpus 42

5 Clustering Answers 4451 Testing Different Similarity Metrics 44

511 Clusterer Metrics 44512 Results 46513 Results Analysis 47

52 New Distance Metric - rsquoJFKDistancersquo 48

7

521 Description 48522 Results 50523 Results Analysis 50

53 Possible new features ndash Combination of distance metrics to en-hance overall results 50

54 Posterior analysis 51

6 Conclusions and Future Work 53

8

List of Figures

1 JustAsk architecture mirroring the architecture of a general QAsystem 24

2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the

command line 294 Proposed approach to automate the detection of relations be-

tween answers 385 Relationsrsquo module basic architecture Each answer has now a field

representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40

6 Execution of the application to create relationsrsquo corpora in thecommand line 43

9

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 8: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

521 Description 48522 Results 50523 Results Analysis 50

53 Possible new features ndash Combination of distance metrics to en-hance overall results 50

54 Posterior analysis 51

6 Conclusions and Future Work 53

8

List of Figures

1 JustAsk architecture mirroring the architecture of a general QAsystem 24

2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the

command line 294 Proposed approach to automate the detection of relations be-

tween answers 385 Relationsrsquo module basic architecture Each answer has now a field

representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40

6 Execution of the application to create relationsrsquo corpora in thecommand line 43

9

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 9: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

List of Figures

1 JustAsk architecture mirroring the architecture of a general QAsystem 24

2 How the software to create the lsquoanswers corpusrsquo works 283 Execution of the application to create answers corpora in the

command line 294 Proposed approach to automate the detection of relations be-

tween answers 385 Relationsrsquo module basic architecture Each answer has now a field

representing the set of relations it shares with other answers Theinterface Scorer made to boost the score of the answers sharinga particular relation has two components Baseline Scorer tocount the frequency and Rules Scorer to count relations of equiv-alence and inclusion 40

6 Execution of the application to create relationsrsquo corpora in thecommand line 43

9

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 10: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

List of Tables

1 JustAsk baseline best results for the corpus test of 200 questions 322 Answer extraction intrinsic evaluation using Google and two dif-

ferent amounts of retrieved passages 323 Answer extraction intrinsic evaluation using 64 passages from

the search engine Google 334 Answer extraction intrinsic evaluation for different amounts of

retrieved passages according to the question category 335 Baseline results for the three sets of corpus that were created

using TREC questions and New York Times corpus 346 Accuracy results after the implementation of the features de-

scribed in Sections 4 and 5 for 20 30 50 and 100 passages ex-tracted from the NYT corpus 34

7 Precision (P) Recall (R)and Accuracy (A) Results before andafter the implementation of the features described in 41 42 and43 Using 30 passages from New York Times for the corpus of1210 questions from TREC 42

8 Precision (P) Recall (R) and Accuracy (A) results for every sim-ilarity metric added for a given threshold value using the testcorpus of 200 questions and 32 passages from Google Best pre-cision and accuracy of the system bolded 47

9 Precision (P) Recall (R) and Accuracy (A) Results after theimplementation of the new distance metric Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC 50

10 Accuracy results for every similarity metric added for each cate-gory using the test corpus of 200 questions and 32 passages fromGoogle Best precision and accuracy of the system bolded exceptfor when all the metrics give the same result 51

10

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 11: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Acronyms

TREC ndash Text REtrieval ConferenceQA ndash Question-AnsweringPOS ndash Part of Speech

11

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 12: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

1 Introduction

QA Systems have been with us since the second half of the 20th century Butit was with the world wide web explosion that QA systems truly broke outleaving restricted domains for global worldwide ones

With the exponential growth of information finding specific informationquickly became an issue Search engines were and still are a valid option butthe amount of the noisy information they generate may not be worth the shotndash after all most of the times we just want a specific single lsquocorrectrsquo answerinstead of a set of documents deemed relevant by the engine we are using Thetask of finding the most lsquocorrectrsquo answer for a given question should be the mainpurpose of a QA system after all

This led us to the development of JustAsk [39 38] a QA system with themission to retrieve (as any other QA system) answers from the web to a givenquestion JustAsk combines hand-crafted rules and machine-learning compo-nents and it is composed by the same modules that define a general QA ar-chitecture Question Interpretation Passage Retrieval and Answer ExtractionUnfortunately the Answer Extraction module is not as efficient as we wouldlike it to be

QA systems usually base the process of finding an answer using the principleof redundancy [22] ie they consider the lsquorightrsquo answer as the one that appearsmore often in more documents Since we are working with the web as ourcorpus we are dealing with the risk of several variations of the same answerbeing perceived as different answers when in fact they are only one and shouldbe counted as such To solve this problem we need ways to connect all of thesevariations

In resume the goals of this work are

ndash To perform a state of the art on answersrsquo relations and paraphrase iden-tification taking into special account the approaches that can be moreeasily adopted by a QA system such as JustAsk

ndash improve the Answer Extraction module by findingimplementing a propermethodologytools to correctly identify relations of equivalence and inclu-sion between answers through a relations module of sorts

ndash test new clustering measures

ndash create corpora and software tools to evaluate our work

This document is structured in the following way Section 2 describes specificrelated work in the subject of relations between answers and paraphrase iden-tification Section 3 details our point of departure ndash JustAsk system ndash and thetasks planned to improve its Answer Extraction module along with the baselineresults prior to this work Section 4 describes the main work done in relatinganwers and respective results Section 5 details the work done for the Clustering

12

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 13: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

stage and the results achieved and finally on Section 6 we present the mainconclusions and future work

13

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 14: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

2 Related Work

As previously stated the focus of this work is to find a better formula tocatch relations of equivalence from notational variances to paraphrases Firstwe identify the relations between answers that will be the target of our studyAnd since paraphrases are such a particular type of equivalences (and yet withthe potential to encompass all types of equivalences if we extend the conceptenough) we then talk about the different techniques to detect paraphrases de-scribed in the literature

21 Relating Answers

As we said in Section 1 QA systems usually base the process of recoveringanswers on the principle of redundancy ie every information item is likely tohave been stated multiple times in multiple ways in multiple documents [22]But this strategy only works if it has a methodology of relating answers behindand that is still a major task in QA

Our main goal here is to learn how a pair of answers are related so thatour system can retrieve a more correct final answer based on frequency countand the number and type of relationships between pairs present in the relevantpassages

From the terminology first proposed by Webber and his team [44] and laterdescribed by Moriceau [29] we can define four different types of relationshipsbetween answers

Equivalence ndash if answers are consistent and entail mutually whether nota-tional variations (lsquoJohn F Kennedyrsquo and lsquoJohn Kennedyrsquo or lsquo24 October 1993rsquoand lsquoOctober 23rd 1993rsquo) or paraphrases (from synonyms and rephrases suchas lsquoHe was shot and diedrsquo vs lsquoHe was murderedrsquo to inferences such as lsquoJudyGarland was Liza Minellirsquos motherrsquo and lsquoLiza Minelli was Judy Garlandrsquos sisterrsquo)

Inclusion ndash if answers are consistent and differ in specificity one entailingthe other They can be of hypernymy (for the question lsquoWhat kind of animal isSimbarsquo can be answered with lsquolionrsquo or lsquomammalrsquo) or meronymy (for the ques-tion lsquoWhere is the famous La Scala opera housersquo lsquoMilanrsquo lsquoItalyrsquo and lsquoEuropersquoare all possible answers)

Alternative ndash if answers are not entailing at all they constitute valid com-plementary answers whether from the same entity (lsquomolten rockrsquo and lsquopredom-inantly 8 elements (Si O Al Mg Na K Ca Fe)rsquo from the question lsquoWhatdoes magma consist ofrsquo) or representing different entities (all the answers tothe question lsquoName a German philosopherrsquo)

Contradictory ndash if answers are inconsistent and their conjunction is invalidFor instance the answers lsquo1430kmrsquo and lsquo1609 kmrsquo for the question lsquoHow longis the Okavango Riverrsquo are contradictory

The two measures that will be mentioned in Section 31 to cluster answersare still insufficient for a good clustering at JustAsk For some answers like

14

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 15: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

lsquoGeorge Bushrsquo and lsquoGeorge W Bushrsquo we may need semantic information sincewe may be dealing with two different individuals

22 Paraphrase Identification

According to WordReferencecom 1 a paraphrase can be defined as

ndash (noun) rewording for the purpose of clarification

ndash (verb) express the same message in different words

There are other multiple definitions (other paraphrases of the definition ofthe word lsquoparaphrasersquo) In the context of our work we define paraphrases asexpressing the same idea in two different text fragments in a symmetric relation-ship in which one entails the other ndash meaning potential addition of informationwith only one fragment entailing the other should be discarded as such

Here is an example of two sentences which should be deemed as paraphrasestaken from the Microsoft Research Paraphrase Corpus 2 from now on mentionedas the MSR Corpus

Sentence 1 Amrozi accused his brother whom he called lsquothe wit-nessrsquo of deliberately distorting his evidenceSentence 2 Referring to him as only lsquothe witnessrsquo Amrozi accusedhis brother of deliberately distorting his evidence

In the following pages we will present an overview of some of approachesand techniques to paraphrase identification related to natural language speciallyafter the release of the MSR Corpus

221 SimFinder ndash A methodology based on linguistic analysis ofcomparable corpora

We can view two comparable texts as collections of several potential para-phrases SimFinder [9] was a tool made for summarization from a team led byVasileios Hatzivassiloglou that allowed to cluster highly related textual unitsIn 1999 they defined a more restricted concept of similarity in order to re-duce ambiguities two text units are similar if lsquothey share the same focus ona common concept actor object or actionrsquo In addition to that the commonobjectactor must either be the subject of the same description or must perform(or be subjected to) the same action The linguistic features implemented canbe grouped into two types

1httpwwwwordreferencecomdefinitionparaphrase2The Microsoft Research Paraphrase Corpus (MSR Paraphrase Corpus) is a file composed

of 5800 pairs of sentences which have been extracted from news sources over the web withmanual (human) annotations indicating whether a pair is a paraphrase or not It is used in aconsiderable amount of work in Paraphrase Identification

15

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 16: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

ndash Primitive Features ndash encompassing several ways to define a match suchas word co-occurrence (sharing a single word in common) matching nounphrases shared proper nouns WordNet [28] synonyms and matchingverbs with the same semantic class These matches not only consideridentical words but also words that match on their stem ndash the most basicform of a word from which affixes are attached

ndash Composite Features ndash combination of pairs of primitive features to im-prove clustering Analyzing if two joint primitives occur in the sameorder within a certain distance or matching joint primitives based onthe primitivesrsquo type ndash for instance matching a noun phrase and a verbtogether

The system showed better results than TD-IDF [37] the classic informationretrieval method already in its first version Some improvements were madelater on with a larger team that included Regina Barzilay a very prominentresearcher in the field of Paraphrase Identification namely the introduction ofa quantitative evaluation measure

222 Methodologies purely based on lexical similarity

In 2003 Barzilay and Lee [3] used the simple N-gram overlap measure toproduce clusters of paraphrases The N-gram overlap is used to count howmany N-grams overlap in two sentences (usually with N le 4) In other wordshow many different sequences of words we can find between two sentences Thefollowing formula gives us the measure of the Word Simple N-gram Overlap

NGsim(S1 S2) =1

Nlowast

Nsumn=1

Count match(N minus gram)

Count(N minus gram)(1)

in which Count match(N-gram) counts how many N-grams overlap for agiven sentence pair and Count(N-gram) represents the maximum number ofN-grams in the shorter sentence

A year later Dolan and his team [7] used Levenshtein Distance [19] to extractsentence-level paraphrases from parallel news sources and compared it withan heuristic strategy that paired initial sentences from different news storiesLevenshtein metric calculates the minimum number of editions to transformone string into another Results showed that Levenshtein data was cleanerand more easily aligned but the Alignment Error Rate (AER) an objectivescoring function to measure how well an algorithm can align words in the corpuscombining precision and recall metrics described in [31] showed that the twostrategies performed similarly each one with strengths and weaknesses A stringedit distance function like Levenshtein achieved high precision but involvedinadequate assumptions about the data while the heuristic allowed to extractmore rich but also less parallel data

16

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 17: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

In 2007 Cordeiro and his team [5] analyzed previous metrics namely Leven-shtein Distance and Word N-Gram overlap family of similarity measures [3 32]and concluded that these metrics were not enough to identify paraphrases lead-ing to wrong judgements and low accuracy scores The Levenshtein distance andan overlap distance measuring how many words overlap (see Section 313) areironically the two metrics that the actual answer extraction module at JustAskpossesses at the moment so this is a good starting point to explore some otheroptions As an alternative to these metrics a new category called LogSim wasproposed by Cordeirorsquos team The idea is for a pair of two input sentences tocount exclusive lexical links between them For instance in these two sentences

S1 Mary had a very nice dinner for her birthdayS2 It was a successful dinner party for Mary on her birthday

there are four links (lsquodinnerrsquo lsquoMaryrsquo lsquoherrsquo lsquobirthdayrsquo) ie four commonwords between S1 and S2 The function in use is

L(x λ) = minus log 2(λx) (2)

where x is the number of words of the longest sentence and λ is the number oflinks between S1 and S2 (with the logarithm acting as a slow penalty for toosimilar sentences that are way too similar) The complete formula is writtenbelow

logSim(S1 S2) =

L(x λ) ifL(x λ) lt 10eminusklowastL(xλ) otherwise

(3)

This ensures that the similarity only variates between 0 and 1 The ideabehind the second branch is to penalize pairs with great dissimilarities boostedby a constant k (used as 27 in their experiments) as λ

x tends to 0 and thereforeL(x λ) tends to infinity However having a metric depending only on x and noty ie the number of words in the shortest sentence is not a perfect measurendash after all they are not completely independent since λ le y Therefore anexpanded version of LogSim was also developed and tested by the same teamnow accepting three parameters λ x and y

L(x y z) is computed using weighting factors α and β (β being (1-α)) ap-plied to L(x λ) and L(y λ) respectively They used three self-constructedcorpus using unions of the MSR Corpus the Knight and Marcu Corpus 3 andextra negative pairs (of non-paraphrases) to compensate for the lack of nega-tive pairs offered by the two corpus When compared to the Levenshtein andN-gram metrics LogSim performed substantially better when the MSR Para-phrase Corpus was present in the corpus test

3This was a corpus used by Kevin Knight and Daniel Marcu produced manually from pairsof texts and respective summaries containing 1087 similar sentence pairs with one sentencebeing a compressed version of another

17

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 18: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

223 Canonicalization as complement to lexical features

Zhang and Patrick [47] offer another sentence-level solution converting thepairs of sentences into canonical forms at the lexical and syntactic level ndashlsquogenericrsquo versions of the original text ndash and then applying lexical matching fea-tures such as Edit Distance and N-gram overlap This canonicalization involvesparticularly number entities (dates times monetary values percentages) pas-sive to active voice transformation (through dependency trees using Minipar[21]) and future tense sentences (mapping structures such as lsquoplans torsquo or lsquoex-pects to dorsquo into the word lsquowillrsquo) They also claimed that the MSR ParaphraseCorpus is limited by counting the number of the nominalizations (ie pro-ducing a noun from another part of the speech like lsquocombinationrsquo from lsquocom-binersquo) through a semi-automatic method and concluding that the MSR hasmany nominalizations and not enough variety ndash which discards a certain groupof paraphrases ndash the ones that use (many) different words to express the samemeaning This limitation can also explain why this method performs as wellas using a baseline approach Moreover the baseline approach with simplelexical matching features led to better results than other more lsquosophisticatedrsquomethodologies which again should support the evidence of the lack of lexical va-riety in the corpus used JustAsk already has a normalizer for answers of typeNumericCount and NumericDate so the kind of equivalences involvingNumeric Entities that can be considered answers for certain types of questionsare already covered Passive to Active Voice and Future Tense transformationscan however matter to more particular answers Considering for instance thequestion lsquoHow did President John F Kennedy diersquo the answers lsquoHe was shotrsquoand lsquoSomeone shot himrsquo are paraphrases in the passive and active voice respec-tively

224 Use of context to extract similar phrases

While dealing with the task of sentence alignment in monolingual copora in thefield of text-to-text generation Barzilay and Elhadad [2] raised a very pertinentquestion Two sentences sharing most of their words are likely to be paraphrasesBut these are the so-called lsquoeasyrsquo paraphrases There are many paraphrasesthat convey the same information but that share little lexical resemblance Andfor that the usual lexical similarity measures mentioned above should not beenough by themselves They proposed a two-phase process first using learningrules to match pairs of paragraphs based on similar topic structure (with twoparagraphs being grouped because they convey the same type of informationsuch as weather reports for instance) and then refining the match throughlocal alignement Not only the easy lsquoparaphrasesrsquo were identified but also someof the tricky ones by combining lexical similarity (a simple cosine measure)with dynamic programming Considering two sentences S1 and S2 their localsimilarity is given by

sim(S1 S2) = cos(S1 S2)minusmismatch penalty (4)

18

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 19: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

with the mismatch penalty being a decimal value used to penalize pairs withvery low similarity measure The weight of the optimal alignment between twopairs is given through the following formula of dynamic programming ndash a wayof breaking down a complex program into a series of simpler subproblems

s(S1 S2) = max

s(S1 S2minus 1)minus skip penaltys(S1minus 1 S2)minus skip penaltys(S1minus 1 S2minus 1) + sim(S1 S2)s(S1minus 1 S2minus 2) + sim(S1 S2) + sim(S1 S2minus 1)s(S1minus 2 S2minus 1) + sim(S1 S2) + sim(S1minus 1 S2)s(S1minus 2 S2minus 2) + sim(S1 S2minus 1) + sim(S1minus 1 S2)

(5)with the skip penalty preventing only sentence pairs with high similarity fromgetting aligned

Results were interesting A lsquoweakrsquo similarity measure combined with contex-tual information managed to outperform more sophisticated methods such asSimFinder using two collections from Encyclopedia Britannica and BritannicaElementary Context matters

This whole idea should be interesting to explore in a future line of investi-gation within JustAsk where we study the answerrsquos context ie the segmentswhere the answer has been extracted This work has also been included becauseit warns us for the several different types of paraphrasesequivalences that canexist in a corpus (more lexical-based or using completely different words)

225 Use of dissimilarities and semantics to provide better similar-ity measures

In 2006 Qiu and colleagues [33] have proposed a two-phased framework usingdissimilarity between sentences In the first phase similarities between tuplepairs (more than a single word) were computed using a similarity metric thattakes into account both syntactic structure and tuple content (derived from athesaurus) Then the remaining tuples that found no relevant match (unpairedtuples) which represent the extra bits of information than one sentence containsand the other does not are parsed through a dissimilarity classifier and thusto determine if we can deem the two sentences as paraphrases

Mihalcea and her team [27] decided that a semantical approach was essentialto identify paraphrases with better accuracy By measuring semantic similaritybetween two segments combining corpus-based and knowledge-based measuresthey were able to get much better results than a standard vector-based modelusing cosine similarity

They took into account not only the similarity but also the lsquospecificityrsquo be-tween two words giving more weight to a matching between two specific words(like two tonalities of the same color for instance) than two generic conceptsThis specificity is determined using inverse document frequency (IDF) [37] given

19

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 20: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

by the total number of documents in corpus divided by the total number of doc-uments including the word For the corpus they used British National Corpus4 a collection of 100 million words that gathers both spoken-language andwritten samples

The similarity of two text segments T1 and T2 (sim(T1T2)) is given by thefollowing formula

1

2(

sumwisinT1(maxsim(w T2)idf(w))sum

wisinT1 idf(w)+

sumwisinT2(maxsim(w T1)idf(w))sum

wisinT2 idf(w))

(6)For each word w in the segment T1 the goal is to identify the word that

matches the most in T2 given by maxsim (w T2) and then apply the specificityvia IDF make a sum of these two methods and normalize it with the length ofthe text segment in question The same process is applied to T2 and in the endwe generate an average between T1 to T2 and vice versa They offer us eightdifferent similarity measures to calculate maxsim two corpus-based (PointwiseMutual Information [42] and Latent Semantic Analysis ndash LSA [17]) and sixknowledge-based (Leacock and Chodorowndashlch [18] Lesk [1] Wu amp Palmerndashwup[46] Resnik [34] Lin [22] and Jiang amp Conrathndashjcn [14]) Below we offer abrief description of each of these metrics

ndash Pointwise Mutual Information (PMI-IR) ndash first suggested by Turney [42]as a measure of semantic similarity of words is based on the word co-occurrence using counts collected over very large corpora Given twowords their PMI-IR score indicates a degree of dependence between them

ndash Latent Semantic Analysis (LSA) ndash uses a technique in algebra called Sin-gular Value Decomposition (SVD) which in this case decomposes theterm-document matrix into three matrixes (reducing the linear systeminto multidimensional vectors of semantic space) A term-document ma-trix is composed of passages as rows and words as columns with the cellscontaining the number of times a certain word was used in a certain pas-sage

ndash Lesk ndash measures the number of overlaps (shared words) in the definitions(glosses) of two concepts and concepts directly related through relationsof hypernymy and meronymy

ndash Lch ndash computes the similarity between two nodes by finding the pathlength between two nodes in the is minus a hierarchy An is minus a hierarchydefines specializations (if we travel down the tree) and generalizationsbetween concepts (if we go up)

ndash Wup ndash computes the similarity between two nodes by calculating the pathlength from the least common subsumer (LCS) of the nodes ie the most

4httpwwwnatcorpoxacuk

20

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 21: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

specific node two nodes share in common as an ancestor If one node islsquocatrsquo and the other is lsquodogrsquo the LCS would be lsquomammalrsquo

ndash Resnik ndash takes the LCS of two concepts and computes an estimation of howlsquoinformativersquo the concept really is (its Information Content) A conceptthat occurs frequently will have low Information Content while a conceptthat is quite rare will have a high Information Content

ndash Lin ndash takes the Resnik measure and normalizes it by using also the Infor-mation Content of the two nodes received as input

ndash Jcn ndash uses the same information as Lin but combined in a different way

Kozareva Vazquez and Montoyo [16] also recognize the importance of havingsemantic knowledge in Textual Entailment Recognition (RTE) methods iemethods to decide if one text can be entailed in another and vice versa ndash inother words paraphrase identification Moreover they present a new model inorder to detect semantic relations between word pairs from different syntacticcategories Comparisons should not be only limited between two verbs twonouns or two adjectives but also between a verb and a noun (lsquoassassinatedrsquovs lsquoassassinationrsquo) or an adjective and a noun (lsquohappyrsquo vs rsquohappinessrsquo) forinstance They measure similarity using variations of features such as

ndash Sentence Expansion ndash expanding nouns verbs adjectives and adverbsto their synonyms antonyms and verbs that precedeentail the verbs inquestion Word sense disambiguation is used to reduce the potential noiseof such expansion

ndash Latent Semantic Analysis (LSA) ndash by constructing a term-domain matrixinstead of a term-document one in which the domain is extracted fromthe WordNet domain resource

ndash Cosine measure ndash by using Relevant Domains instead of word frequency

Fernando and Stevenson [8] also considered that a purely lexical similarityapproach such as the one conducted by Zhang and Patrick was quite limitedsince it failed to detect words with the same meaning and therefore proposeda semantic similarity approach based on similarity information derived fromWordNet and using the concept of matrix similarity first developed by Steven-son and Greenwood in 2005 [41]

Each input sentence is mapped into a binary vector with the elements equalto 1 if a word is present and 0 otherwise For a pair of sentences S1 and S2 wehave therefore two vectors ~a and ~b and the similarity is calculated using thefollowing formula

sim(~a~b) =~aW~b

~a ~b (7)

21

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 22: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

with W being a semantic similarity matrix containing the information aboutthe similarity level of a word pair ndash in other words a number between 0 and1 for each Wij representing the similarity level of two words one from eachsentence Six different WordNet metrics were used (the already mentioned LchLesk Wup Resnik Lin and Jcn) five of them with an accuracy score higherthan either Mihalcea Qiu or Zhang and five with also a better precision ratingthan those three metrics

More recently Lintean and Rus [23] identify paraphrases of sentences usingweighted dependences and word semantics taking into account both the simi-larity and the dissimilarity between two sentences (S1 and S2) and then usingthem to compute a final paraphrase score as the ratio of the similarity anddissimilarity scores

Paraphrase(S1S2) = Sim(S1 S2)Diss(S1 S2) (8)

If this score is above a certain threshold T the sentences are deemed para-phrases otherwise they are not considered paraphrases The threshold isobtained by optimizing the performance of the approach on training dataThey used word-to-word similarity metrics based on statistical information fromWordNet which can potentially identify semantically related words even if theyare not exactly synonyms (they give the example of the words lsquosayrsquo and lsquoinsistrsquo)These similarity metrics are obtained through weighted dependency (syntactic)relationships [10] built between two words in a sentence a head and a modifierThe weighting is done considering two factors the typelabel of the dependencyand the depth of a dependency in the dependency tree Minipar [21] and theStanford parser [4] were used to extract dependency information To computethe scores they first map the input sentences into sets of dependencies detectthe common and non-common dependencies between the two and then theycalculate the Sim(S1S2) and Diss(S1S2) parameters If two sentences refer tothe same action but one is substantially longer than the other (ie it adds muchmore information) then there is no two-way entailment and therefore the twosentences are non-paraphrases Lintean and Rus decided to add a constant value(the value 14) when the set of unpaired dependencies of the longer sentence hasmore than 6 extra dependencies than the set of unpaired dependencies of theshorter one The duo managed to improve Qiursquos approach by using word seman-tics to compute similarities and by using dependency types and the position inthe dependency tree to weight dependencies instead of simply using unlabeledand unweighted relations

23 Overview

Thanks to the work initiated by Webber et al we can now define four differenttypes of relations between candidate answers equivalence inclusion contradic-tion and alternative Our focus will be on detecting relations of equivalence

22

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 23: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

expanding the two metrics already available in JustAsk (Levenshtein distanceand overlap distance)

We saw many different types of techniques over the last pages and manymore have been developed over the last ten years presenting variations of theideas which have been described such as building dependency trees or usingsemantical metrics dealing with the notions such as synonymy antonymy andhyponymy For the purpose of synthesis the work here represents what wefound with the most potential of being successfully tested on our system

There are two basic lines of work here the use of lexical-based featuressuch as the already implemented Levenshtein distance and the use of semanticinformation (from WordNet) From these there were several interesting featuresattached such as the transformation of units into canonical forms the potentialuse of context surrounding the answers we extract as candidates dependencytrees similarity matrixes and the introduction of a dissimilarity score in theformula of paraphrase identification

For all of these techniques we also saw different types of corpus for eval-uation The most prominent and most widely available is MSR ParaphraseCorpus which was criticized by Zhang and Patrick for being composed of toomany lexically similar sentences Other corpus used was Knight and Marcucorpus (KMC) composed of 1087 sentence pairs Since these two corpus lacknegative pairs ie pairs of non-paraphrases Cordeiro et al decided to createthree new corpus using a mixture of these two with negative pairs in order tobalance the number of positive pairs and negative ones Finally British NationalCorpus (BNC) was used for the semantic approach of Mihalcea et al

23

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 24: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

3 JustAsk architecture and work points

The work of this thesis will be done on JustAsk [39] a QA system developedat INESC-ID5 that combines rule-based components with machine learning-based ones Over this chapter we will describe the architecture of this systemfocusing on the answers module and show the results already attained by thesystem in general and by the Answer Extraction module We then detail whatwe plan to do to improve this module

31 JustAsk

This section describes what has been done so far with JustAsk As saidbefore the systemrsquos architecture follows the standard QA architecture withthree stages Question Interpretation Passage Retrieval and Answer Extrac-tion The picture below shows the basic architecture of JustAsk We will workover the Answer Extraction module dealing with relationships between answersand paraphrase identification

Figure 1 JustAsk architecture mirroring the architecture of a general QAsystem

311 Question Interpretation

This module receives as input a natural language question and outputs aninterpreted question In this stage the idea is to gather important information

5httpwwwinesc-idpt

24

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 25: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

about a question (question analysis) and classify it under a category that shouldrepresent the type of answer we look for (question classification)

Question analysis includes tasks such as transforming the question into a setof tokens part-of-speech tagging and identification of a questions headwordie a word in the given question that represents the information we are lookingfor Question Classification is also a subtask in this module In JustAsk Ques-tion Classification uses Li and Rothrsquos [20] taxonomy with 6 main categories(Abbreviation Description Entity Human Location and Numeric)plus 50 more detailed classes The framework allows the usage of different clas-sification techniques hand-built rules and machine-learning JustAsk alreadyattains state of the art results in question classification [39]

312 Passage Retrieval

The passage retrieval module receives as input the interpreted question fromthe preceding module The output is a set of relevant passages ie extractsfrom a search corpus that contain key terms of the question and are thereforemost likely to contain a valid answer The module is composed by a group ofquery formulators and searchers The job here is to associate query formulatorsand question categories to search types Factoid-type questions such as lsquoWhenwas Mozart bornrsquo use Google or Yahoo search engines while definition questionslike lsquoWhat is a parrotrsquo use Wikipedia

As for query formulation JustAsk employs multiple strategies accordingto the question category The most simple is keyword-based strategy appliedfor factoid-type questions which consists in generating a query containing allthe words of a given question Another strategy is based on lexico-syntaticpatterns to learn how to rewrite a question in order to get us closer to theactual answer Wikipedia and DBpedia are used in a combined strategy usingWikipediarsquos search facilities to locate the title of the article where the answermight be found and DBpedia to retrieve the abstract (the first paragraph) ofthe article which is returned as the answer

313 Answer Extraction

The answer extraction module receives an interpreted question and a set ofrelevant passages as input and outputs the final answer There are two mainstages here Candidate Answer Extraction and Answer Selection The firstdeals with extracting the candidate answers and the latter with normalizationclustering and filtering the candidate answers to give us a correct final answer

JustAsk uses regular expressions (for Numeric type questions) named en-tities (able to recognize four entity types Person Location Organiza-tion and Miscellaneous) WordNet based recognizers (hyponymy relations)and Gazeteers (for the categories LocationCountry and LocationCity)to extract plausible answers from relevant passages Currently only answers oftype NumericCount and NumericDate are normalized Normalization isimportant to diminish the answers variation converting equivalent answers to

25

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 26: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

a canonical representation For example the dates lsquoMarch 21 1961rsquo and lsquo21March 1961rsquo will be transformed into lsquoD21 M3 Y1961rsquo

Similar instances are then grouped into a set of clusters and for that pur-pose distance metrics are used namely overlap distance and Levenshtein [19]distance The overlap distance is defined as

Overlap(XY ) = 1minus |X⋂Y |(min(|X| |Y |) (9)

where |X⋂Y | is the number of tokens shared by candidate answers X and Y and

min(|X| |Y |) the size of the smallest candidate answer being compared Thisformula returns 00 for candidate answers that are either equal or containedin the other As it was mentioned in Section 313 the Levenshtein distancecomputes the minimum number of operations in order to transform one stringto another These two distance metrics are used with a standard single-linkclustering algorithm in which the distance between two clusters is considered tobe the minimum of the distances between any member of the clusters Initiallyevery candidate answer has a cluster of its own and then for each step the twoclosest clusters are merged if they are under a certain threshold distance Thealgorithm halts when the distances of the remaining clusters are higher thanthe threshold Each cluster has a representative ndash the clusters longest answerand a score is assigned to the cluster which is simply the sum of the scoresof each candidate answer With the exception of candidate answers extractedusing regular expressions the score of a candidate answer is always 10 meaningthe clusters score will in many cases be directly influenced by its length

32 Experimental Setup

Over this section we will describe the corpora we used to evaluate our system

321 Evaluation measures

To evaluate the results of JustAsk the following measures suited to evaluateQA systems [43] are used

Precision

Precision =Correctly judged items

Relevant (nonminus null) items retrieved(10)

Recall

Recall =Relevant (nonminus null) items retrieved

Items in test corpus(11)

F-measure (F1)

F = 2 lowast precision lowast recallprecision+ recall

(12)

Accuracy

Accuracy =Correctly judged items

Items in test corpus(13)

26

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 27: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Mean Reciprocal Rank (MRR) used to evaluate systems that pro-duce a list of possible responses to a query ordered by probability ofcorrectness The reciprocal rank of a query response is the multiplicativeinverse of the rank of the first correct answer The mean reciprocal rankis the average of the reciprocal ranks of results for a sample of queries Qin a test corpus For a test collection of N queries the mean reciprocalrank is given by

MRR =1

N

Nsumi=1

1

ranki (14)

322 Multisix corpus

To know for sure that a certain system works we have to evaluate it In thecase of QA systems and with JustAsk in particular we evaluate it giving itquestions as input and then we see if the answers the system outputs matchthe ones we found for ourselves

We began to use the Multisix corpus [24] It is a collection of 200 Englishquestions covering many distinct domains composed for the cross-language tasksat CLEF QA-2003 The corpus has also the correct answer for each questionsearched in the Los Angeles Times over 1994 We call references to the setof correct answers for each of the 200 questions A manual validation of thecorpus was made adding to the references many variations of the answers foundover the web sources (such as adding lsquoImola Italyrsquo lsquoImolarsquo and lsquoItalyrsquo to thequestion lsquoWhere did Ayrton Senna have the accident that caused his deathrsquo)In the context of this thesis besides this manual validation an application wasdeveloped to create the test corpus of the references in a more efficient manner

The 200 questions were also manually classified according to Li and Rothrsquosquestion type taxonomy A few answers were updated since the question variedin time (one example is the question lsquoWho is the coach of Indianarsquo) or eveneliminated (10 of the 200 to be more accurate) since they no longer made sense(like asking who is the vice-president of a company that no longer exists or howold is a person that has already died) And since the Multisix corpus lacked onDescriptionDescription and HumanDefinition questions 12 questionswere gathered and used from the Multieight corpus [25]

For each of the 200 questions we have a set of references ndash possible correctanswers ndash initially being updated manually from the first 64 passages that wefound over at each of the two main search engines (Google and Yahoo) ndash whichwe call lsquosnippetsrsquo This process is quite exhaustive itself The corpus file with allthe results that the GoogleYahoo query produces is huge (a possible maximumof 64200 entries) and to review it takes a considerable amount of time andpatience Moreover we do not need all the information in it to add a referenceTaking for instance this entry

1 Question=What is the capital of Somalia

URL= httpenwikipediaorgwikiMogadishu Title=Mogadishu- Wikipedia the free encyclopedia Content=Mogadishu is the largest

27

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 28: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

city in Somalia and the nationrsquos capital Located in the coastalBenadir region on the Indian Ocean the city has served as an Rank=1 Structured=false

The attributes lsquoURLrsquo lsquoTitlersquo lsquoRankrsquo and lsquoStructuredrsquo are irrelevant for themere process of adding a single (or multiple) references The only attribute thatreally matters to us here is lsquoContentrsquo ndash our snippet (or relevant passage)

An application was made in order to automate some of the process involvedin adding references to a given question ndash creating the corpus containing everyreference captured for all the questions This application receives two files asarguments one with a list of questions and their respective references and thecorpus file containing the corpus with a stipulated maximum of 64 snippets perquestion The script runs the list of questions one by one and for each one ofthem presents all the snippets one by one removing lsquounnecessaryrsquo content likelsquoURLrsquo or lsquoTitlersquo requesting an input from the command line

Figure 2 How the software to create the lsquoanswers corpusrsquo works

This input is the reference one picks from a given snippet So if the contentis lsquoMogadishu is the largest capital of Somalia rsquo if we insert lsquoMogadishursquothe system recognizes it as a true reference ndash since it is included in the contentsection and adds it to the lsquoreferences listrsquo which is nothing more than an arrayof arrays Of course this means any word present in the content is a viablereference and therefore we can write any word in the content which has norelation whatsoever to the actual answer But it helps one finding and writingdown correct answers instead of a completely manual approach like we didbefore writing the application In this actual version the script prints thestatus of the reference list to an output file as it is being filled question byquestion by the final snippet of each question Figure 2 illustrates the process

28

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 29: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

while Figure 3 presents a snap shot of the application running This softwarehas also the advantage of potentially construct new corpora in the future

Figure 3 Execution of the application to create answers corpora in the commandline

29

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 30: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

323 TREC corpus

In an effort to use a more extended and lsquostaticrsquo corpus we decided to useadditional corpora to test JustAsk Therefore in addition to Google and Yahoopassages we now have also 20 years of New York Times editions from 1987 to2007 and in addition to the 200 test questions we found a different set at TRECwhich provided a set of 1768 questions from which we filtered to extract thequestions we knew the New York Times had a correct answer 1134 questionsremained after this filtering stage

TREC questions

The TREC data is composed by a set of questions and also by a set ofcorrectincorrect judgements from different sources according to each question(such as the New York Times online editions) ndash answers that TREC had judgedas correct or incorrect We therefore know if a source manages to answer cor-rectly to a question and what is the lsquocorrectrsquo answer it gives A small applicationwas created using the file with all the 1768 questions and the file containing allthe correct judgements as input parameters in order to filter only the questionswe know the New York Times managed to retrieve a correct answer

After experimenting with the 1134 questions we noticed that many of thesequestions are actually interconnected (some even losing an original referenceafter this filtering step) and therefore losing precious entity information byusing pronouns for example which could explain in part the underwhelmingresults we were obtaining with them Questions such as lsquoWhen did she diersquoor lsquoWho was his fatherrsquo remain clearly unanswered due to this lost referenceWe simply do not know for now who lsquohersquo or lsquoshersquo is It makes us questionldquoHow much can we improve something that has no clear references because thequestion itself is referencing the previous one and so forthrdquo

The first and most simple solution was to filter the questions that containedunknown third-party references ndash pronouns such as lsquohersquo lsquoshersquo lsquoitrsquo lsquoherrsquo lsquohimrsquolsquohersrsquo or lsquohisrsquo plus a couple of questions we easily detected as driving the systemto clear and obvious dead ends such as lsquoHow many are there nowrsquo A total of232 questions without these features were collected and then tested

But there were still much less obvious cases present Here is a blatant ex-ample of how we were still over the same problem

question 1129 QWhere was the festival heldquestion 1130 QWhat is the name of the artistic director of thefestivalquestion 1131 QWhen was the festival heldquestion 1132 QWhich actress appeared in two films shown at thefestivalquestion 1133 QWhich film won the Dramatic Screen Play Awardat the festivalquestion 1134 QWhich film won three awards at the festival

30

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 31: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

in this particular case the original reference lsquoSundance Festivalrsquo is actuallylost along with vital historic context (which Sundance festival Is it the 21stthe 25th ) Another blatant example is the use of other complex anaphorssuch as lsquothe companyrsquo instead of simply lsquoitrsquo in the question

We knew that this corpus was grouped by topics and therefore was composedby a set of questions for each topic By noticing that each of these questionsfrom the TREC corpus possessed a topic information that many times wasnot explicit within the question itself the best possible solution should be tosomehow pass this information to the query at the Passage Retrieval moduleBy manually inserting this topic information into the questions from TREC2004 to 2007 (the 2003 questions were independent from each other) in the bestmanner and by noticing a few mistakes in the application to filter the questionsthat had a correct answer from New York Times we reached a total of 1309questions However due to a length barrier at JustAsk we reduced it to 1210questions

New York Times corpus

20 years of New York Times editions were indexed using Lucene Search astool Instead of sentences or paragraphs as relevant passages as it happpenedwith Google and Yahoo search methods we started to return the whole articleas a passage Results were far from being as famous as with the Google corpus

The problem here might lie in a clear lack of redundancy ndash candidate answersnot appearing as often in several variations as in Google

We also questioned if the big problem is within the passage retrieval phasesince we are now returning a whole document instead of a single paragraphphraseas we were with the Google corpus Imagine the system retrieving a documenton the question lsquoWhat is the capital of Iraqrsquo that happens to be a politicaltext mentioning the conflict between the country and the United States andends up mentioning Washington more times over the document than any Iraqcity including the countryrsquos capital

With this in mind we tried another approach in which the indexed passageswould not be the whole documents but only the paragraph that contains thekeywords we send in the query This change improved the results but unfortu-nately not as much as we could hope for as we will see over the next section

33 Baseline

331 Results with Multisix Corpus

Results for the overall system before this work are described over Table 1Precision1 indicates the precision results for the top answer returned (out ofthe three possible final answers JustAsk returns) As we can see in 200 ques-tions JustAsk attempted to answer 175 questions 503 of the times correctlyMoreover looking at the results of both Precision and Precision1 we knowthat JustAsk managed to push the correct answer to the top approximately704 of the time

31

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 32: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Questions Correct Wrong No Answer200 88 87 25

Accuracy Precision Recall MRR Precision144 503 875 037 354

Table 1 JustAsk baseline best results for the corpus test of 200 questions

In Tables 2 and 3 we show the results of the Answer Extraction module usingLevenshtein distance with a threshold of 033 to calculate the distance betweencandidate answers For the different amounts of retrieved passages (8 32 ou 64passages) from Google we present how successful were the Candidate AnswerExtraction (CAE) and Answer Selection (AS) stages The success of CAE stageis measured by the number of questions that manage to extract at least onecorrect answer while the success of the AS stage is given by the number ofquestions that manage to extract one final answer from the list of candidateanswers They also present us a distribution of the number of questions with acertain amount of extracted answers

Google ndash 8 passages Google ndash 32 passages Extracted Questions CAE AS Questions CAE AS

Answers () success success () success success0 21 0 ndash 19 0 ndash1 to 10 73 62 53 16 11 1111 to 20 32 29 22 13 9 721 to 40 11 10 8 61 57 3941 to 60 1 1 1 30 25 1761 to 100 0 ndash ndash 18 14 11gt 100 0 ndash ndash 5 4 3All 138 102 84 162 120 88

Table 2 Answer extraction intrinsic evaluation using Google and two differentamounts of retrieved passages

We can see that this module achieves the best results with 64 retrieved pas-sages If we reduce the number of passages the number of extracted candidatesper question also diminishes But if the number of passages is higher we alsorisk losing precision having way too many wrong answers in the mix Moreoverthe AS and CAE stages are 100 successful with 11 to 20 answers extractedWhen that number increases the accuracies do not show any concrete tendencyto increase or decrease

In Table 4 we show the evaluation results of the answer extraction compo-nent of JustAsk according to the question category The column lsquoTotalrsquo givesus the total number of questions per category with lsquoQrsquo being the number ofquestions for which at least one candidate answer was extracted The successof the CAE and AS stages is defined in the same way as in the previous tables

With the category Entity the system only extracts one final answer no

32

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 33: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Google ndash 64 passages Extracted Questions CAE AS

Answers () success success0 17 0 01 to 10 18 9 911 to 20 2 2 221 to 40 16 12 841 to 60 42 38 2561 to 100 38 32 18gt 100 35 30 17All 168 123 79

Table 3 Answer extraction intrinsic evaluation using 64 passages from thesearch engine Google

Google ndash 8 passages Google ndash 32 passages Google ndash 64 passagesTotal Q CAE AS Q CAE AS Q CAE AS

abb 4 3 1 1 3 1 1 4 1 1des 9 6 4 4 6 4 4 6 4 4ent 21 15 1 1 16 1 0 17 1 0hum 64 46 42 34 55 48 36 56 48 31loc 39 27 24 19 34 27 22 35 29 21num 63 41 30 25 48 35 25 50 40 22

200 138 102 84 162 120 88 168 123 79

Table 4 Answer extraction intrinsic evaluation for different amounts of re-trieved passages according to the question category

matter how many passages are used We found out that this category achievesthe best result with only 8 passages retrieved from Google Using 64 passagesincreases the number of questions but at the same time gives worse resultsfor the AS section in three categories (Human Location and Numeric) com-pared with the results obtained with 32 passages This table also evidences thediscrepancies between the different stages for all of these categories showingthat more effort should be put at a certain stage for a certain category ndash suchas giving more focus to the Answer Selection with the Numeric category forinstance

332 Results with TREC Corpus

For the set of 1134 questions the system answered correctly 234 questions (fora 206 accuracy) using Levenshtein metric with a threshold of 017 and with20 paragraphs retrieved from New York Times which represents an increaseof approximately 322 (and more 57 correct answers than we had) over theprevious approach of using the whole documents as passages This paragraphindexing method proved to be much better overall than document indexing and

33

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 34: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

was therefore used permantly while searching for even better resultsThese 1210 questions proved at least to have a higher accuracy and recall

scores than the previous sets of questions as we can see from the table 5 below

1134 questions 232 questions 1210 questionsPrecision 267 282 274Recall 772 672 769Accuracy 206 190 211

Table 5 Baseline results for the three sets of corpus that were created usingTREC questions and New York Times corpus

Later we tried to make further experiences with this corpus by increasingthe number of passages to retrieve We found out that this change actually hada considerable impact over the system extrinsic results as Table 6 shows

20 30 50 100Accuracy 211 224 237 238

Table 6 Accuracy results after the implementation of the features described inSections 4 and 5 for 20 30 50 and 100 passages extracted from the NYT corpus

Even though the 100 passages offer the best results it also takes more thanthrice the time than running just 20 with a 27 difference in accuracy Sodepending on whether we prefer time over accuracy any of these is viable withthe 30 passagens offering the best middle ground here

34 Error Analysis over the Answer Module

This section was designed to report some of the most common mistakes thatJustAsk still presents over its Answer Module ndash mainly while clustering theanswer candidates failing to cluster equal or inclusive answers into the samecluster but also at the answer extraction phase These mistakes are directlylinked to the distance algorithms we have at the moment (Levenshtein andOverlap metrics)

1 Wrong candidates extracted ndash for the question lsquoHow long is the Oka-vango Riverrsquo for instance there were two answers of the Numeric typeextracted lsquo1000 milesrsquo and lsquo300 kmrsquo with the former being the correctanswer Unfortunately since there are only two instances of these answersthey both share the same score of 10 Further analysis of the passagescontaining the respective answers reveals that correctness

ndash lsquoThe Okavango River rises in the Angolan highlands and flowsover 1000 milesrsquo

34

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 35: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

ndash lsquoNamibia has built a water canal measuring about 300 kmlong Siyabona Africa Travel (Pty) Ltd rdquoPopa Falls mdash Oka-vango River mdash Botswanardquo webpage rsquo

As we can see the second passage also includes the headword lsquoOkavangoRiverrsquo so we can not simply boost the score of the passage that includesthe questionrsquos headword

Possible solution but we can try to give more weight to a candidateanswer if that answer is closer to the keywords used in the search by acertain predefined threshold

2 Parts of person namesproper nouns ndash several candidates over categoriessuch as Person and Location remain in different clusters when theyclearly refer the same entity Here are a few examples

lsquoQueen Elizabethrsquo vs lsquoElizabeth IIrsquo

lsquoMt Kilimanjarorsquo vs lsquoMount Kilimanjarorsquo

lsquoRiver Glommarsquo vs rsquoGlommarsquo

lsquoConsejo Regulador de Utiel-Requenarsquo vs lsquoUtil-Requenarsquo vs lsquoRe-quenarsquo

lsquoCalvin Broadusrsquo vs lsquoBroadusrsquo

Solution we might extend a few set of hand-written rules to relate an-swers such as lsquoQueen Elizabethrsquo and lsquoElizabeth IIrsquo or lsquoBill Clintonrsquo andlsquoMr Clintonrsquo normalizing abbreviations such as lsquoMrrsquo lsquoMrsrsquo lsquoQueenrsquoetc Then by applying a new metric that performs similarity on a to-ken level we might be able to cluster these answers If there is at leastone match between two tokens from candidate answers then it is quitelikely that the two mention the same entity specially if both candidatesshare the same number of tokens and the innitials of the remaining tokensmatch To give a simple example consider the answers lsquoJ F Kennedyrsquoand lsquoJohn Fitzgerald Kennedyrsquo A simple Levenshtein distance will de-tect a considerable number of edit operations Should there be a bettermetric for these particular cases LogSim distance takes into account thenumber of links ie common tokens between two answers which shouldgive it a slight edge to Levenshtein It would if we had more than onlyone common link as we have here (lsquoKennedyrsquo) out of three words for eachanswer So we have LogSim approximately equal to 0014 Levenshteinactually gives us a better score (059)With the new proposed metric we detect the common token lsquoKennedyrsquoand then we check the other tokens In this case we have an easy matchwith both answers sharing the same number of words and each wordsharing at least the same innitial in the same order What should be thethreshold here in case we do not have this perfect alignement If in an-other example we have a single match between answers but the innitialsof the other tokens are different should we reject it The answer is likelyto be a lsquoYesrsquo

35

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 36: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

3 Placesdatesnumbers that differ in specificity ndash some candidates over thecategories Location and numeric that might refer to a correct answerremain in separate clusters In many cases these are similar mistakes asthe ones described in 2 (ie one location composes part of the name ofthe other) but instead of refering the same entity these answers share arelation of inclusion Examples

lsquoMt Kenyarsquo vs lsquoMount Kenya National Parkrsquo

lsquoBavariarsquo vs lsquoOverbayern Upper Bavariarsquo

lsquo2009rsquo vs lsquoApril 3 2009rsquo

Moreover for the question lsquoWhere was the director Michelangelo Anto-nioni bornrsquo the answers lsquoFerrararsquo and lsquoItalyrsquo are found with equal scores(30) In this particular case we can consider either of these answers ascorrect with lsquoFerrararsquo included in Italy (being a more specific answer)But a better answer to be extracted would probably be lsquoFerrara ItalyrsquoThe questions here are mainly ldquoWhere do we draw a limit for inclusiveanswers Which answer should we consider as representative if we groupthese answers in a single clusterrdquo

Solution based on [26] we decided to take into account relations of in-clusion and we are using the same rules that were used over the TRECcorpus now with JustAsk hoping for similar improvements

4 Language issue ndash many entities share completely different names for thesame entity we can find the name in the original language of the entityrsquoslocation or in a normalized English form And even small names such aslsquoGlommarsquo and lsquoGlamarsquo remain in separate clusters

Solution Gazeteer information can probably help here Specific metricssuch as Soundex can also make a positive impact even if in very particularcases such as the lsquoGlommarsquo vs lsquoGlama example

5 Different measuresConversion issues ndash such as having answers in milesvs km vs feet pounds vs kilograms etc

Solution create a normalizer that can act as a converter for the mostimportant measures (distance weight etc)

6 MisspellingsFormatting issues ndash many namesentities appear either mis-spelled or in different formats (capital letters vs lower cases) over theweb While Levenshtein distance may handle most of these cases for theminimum threshold tested there are still a few unresolved such as

lsquoKotashaanrsquo vs lsquoKhotashanrsquo

lsquoGroznyrsquo and lsquoGROZNYrsquo

Solution the solution should simply be normalizing all the answers intoa standard type As for the misspelling issue a similarity measure such

36

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 37: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

as Soundex can solve more specific cases such as the one described aboveThe question remains when and where do we apply it exactly

7 Acronyms ndash seen in a flagrant example with the answers lsquoUSrsquo and lsquoUSrsquounclustered but most generally pairs such as lsquoUnited States of Americarsquoand lsquoUSArsquo

Solution In this case we may extend abbreviations normalization to in-clude this abbreviation

Over this section we have looked at several points of improvement Next wewill describe how we solved all of these points

37

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 38: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

4 Relating Answers

Over this section we will try to solve the potential problems detected duringthe Error Analysis (see Section 34) using for that effort the implementation ofthe set of rules that will compose our lsquorelations modulersquo new normalizers anda new proximity measure

41 Improving the Answer Extraction moduleRelations Module Implementation

Figure 2 represents what we want for our system to detect the relationsdescribed in the Section 21

Figure 4 Proposed approach to automate the detection of relations betweenanswers

We receive several candidate answers of a question as input and then throughwhat we can call a lsquoRelationship Detection Modulersquo relationships between pairsof answers are assigned This module will contain the metrics already available(Levenshtein and word overlap) plus the new metrics taken from the ideas col-lected over the related work We also may want to assign different weights todifferent relationships to value relationships of equivalence over inclusion al-ternative and specially contradictory answers which will possibly allow a betterclustering of answers

The goal here is to extend the metrics already existent to more efficientsemantically-based ones using notions not only of synonymy but also hiponymyand antonymy that we can access through WordNet Moreover we may want to

38

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 39: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

analyze how much can the context that surronds an extracted answer influencethe overall result taking into account what we saw in [2] We also plan to usethe set of rules provided in [35] to enhance the final answer returned to the user

After we have this process automatic and we have extended the answeringmodule we can perform a study on how relationships between answers canaffect the outcome of the answer and therefore the efficiency of the AnswerExtraction module

As it was described in Section 21 we define four types of relations betweenanswers Equivalence Inclusion Alternative and Contradiction For the con-text of aligning candidate answers that should share a same cluster our firstthought was that the equivalence relation had a much bigger weight than allthe others But inclusions should also matter according to [26] Moreover theymight even be more benefic than those of equivalence for a certain category

The work in [26] has taught us that using relations between answers canimprove considerably our results In this step we incorporated the rules used inthe context of that paper to the JustAsk system and created what we proposedto do during the first part of this thesis ndash a module in JustAsk to detect relationsbetween answers and from then use that for a (potentially) better clusteringof answers by boosting the score of candidates sharing certain relations Thiswould also solve point 3 described over section 34 (lsquoPlacesdatesnumbers thatdiffer in specificityrsquo)

These rules cover the following categories Entity Numeric Locationand Human Individual They can be seen as a set of clauses determiningwhich relation we should assign to a pair of candidate answers

The answers now possess a set of relations they share with other answersndash namely the type of relation and the answer they are related to and a scorefor that relation The module itself is composed of an interface (Scorer) whichsupports all the rules for the three types of relations we are interested in counting(and associating to the respective answers) for a better clustering

ndash Frequency ndash ie the number of times the exact same answer appearswithout any variation

ndash Equivalence

ndash Inclusion (divided in two types one for the answer that includes and theother for the answer that is included by the other)

For a first experiment we decided to give a score of 05 to each relation ofinclusion found for each answer and 10 to frequency and equivalence Laterwe decided to test the assessment made in [26] that relations of equivalence areactually more lsquopenalizingrsquo than those of inclusion for the category Entity andgive a score of 05 to relations of equivalence and 10 to inclusions for the ruleswithin that category Results were not different at all for all the corpora

39

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 40: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Figure 5 Relationsrsquo module basic architecture Each answer has now a fieldrepresenting the set of relations it shares with other answers The interfaceScorer made to boost the score of the answers sharing a particular relation hastwo components Baseline Scorer to count the frequency and Rules Scorer tocount relations of equivalence and inclusion

40

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 41: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

42 Additional Features

421 Normalizers

To solve the conversion abbreviation and acronym issues described in points2 5 and 7 of Section 34 a few normalizers were developed For the con-version aspect a specific normalizer class was created dealing with answersof the type NumericDistance NumericSize NumericWeight and Nu-merictemperature that is able to convert kilometres to miles meters to feetkilograms to pounds and Celsius degrees to Fahrenheit respectively

Another normalizer was created to remove several common abbreviaturesfrom the candidate answers hoping that (with the help of new measures) wecould improve clustering too The abbreviatures removed were lsquoMrrsquo lsquoMrsrsquorsquo Mslsquo lsquoKingrsquo lsquoQueenrsquo lsquoDamersquo lsquoSirrsquo and lsquoDrrsquo Over here we find a partic-ular issue worthy of noting we could be removing something we should notbe removing Consider the possible answers lsquoMr Trsquo or lsquoQueenrsquo (the band)Curiously an extended corpus of 1134 questions provided us even other exam-ple we would probably be missing lsquoMr Kingrsquo ndash in this case however we caneasily limit King removals only to those occurences of the word lsquoKingrsquo right atthe beggining of the answer We can also add exceptions for the cases aboveStill it remains a curious statement that we must not underestimate under anycircumstances the richness of a natural language

A normalizer was also made to solve the formatting issue of point 6 to treatlsquoGroznyrsquo and lsquoGROZNYrsquo as equal answers

For the point 7 a generic acronym normalizer was used converting acronymssuch as lsquoUSrsquo to lsquoUnited Statesrsquo or lsquoUNrsquo to lsquoUnited Nationsrsquo ndash for categoriessuch as LocationCountry and HumanGroup As with the case of the nor-malizers above it did not produce any noticeable changes in the system resultsbut it also evaluates very particular cases for the moment allowing room forfurther extensions

422 Proximity Measure

As with the problem in 1 we decided to implement the solution of giving extraweight to answers that remain closer to the query keywords that appear throughthe passages For that we calculate the average distance from the answer tothe several keywords and if that average distance is under a certain threshold(we defined it as being 20 characters long) we boost that answerrsquos score (by 5points)

41

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 42: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

43 Results

200 questions 1210 questionsBefore features P506 R870

A440P280 R795A222

After features A511 R870A445

P282 R798A225

Table 7 Precision (P) Recall (R)and Accuracy (A) Results before and after theimplementation of the features described in 41 42 and 43 Using 30 passagesfrom New York Times for the corpus of 1210 questions from TREC

44 Results Analysis

These preliminary results remain underwhelming taking into account theeffort and hope that was put into the relations module (Section 41) in particular

The variation in results is minimal ndash only one to three correct answersdepending on the corpus test (Multisix or TREC)

45 Building an Evaluation Corpus

To evaluate the lsquorelation modulersquo described over 41 a corpus for evaluatingrelations between answers was designed before dealing with the implementationof the module itself For that effort we took the references we gathered in theprevious corpus and for each pair of references assign one of the possible fourtypes of relation they share (Equivalence Inclusion Contradiction or Alterna-tive) A few of these relations were not entirely clear and in that case a fifthoption (lsquoIn Doubtrsquo) was created This corpus was eventually not used but itremains here for future work purposes

As it happened with the lsquoanswers corpusrsquo in order to create this corporacontaining the relationships between the answers more efficiently a new appli-cation was made It receives the questions and references file as a single inputand then goes to iterate on the references on pairs asking the user to chooseone of the following options corresponding to the relations mentioned in Section21

ndash lsquoErsquo for a relation of equivalence between pairs

ndash lsquoIrsquo if it is a relation of inclusion (this was later expanded to lsquoI1rsquo if thefirst answer includes the second and lsquoI2rsquo if the second answer includes thefirst)

ndash lsquoArsquo for alternative (complementary) answers

ndash lsquoCrsquo for contradictory answers

42

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 43: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

ndash lsquoDrsquo for causes of doubt between relation types

If the command given by the input fails to match any of these options thescript asks the user to repeat the judgment If the question only has a singlereference it just advances to the next question If the user running the scriptwants to skip a pair of references heshe can type lsquoskiprsquo At the end all therelations are printed to an output file too Each line has 1) the identifier of thequestion 2) the type of relationship between the two answers (INCLUSIONEQUIVALENCE COMPLEMENTARY ANSWERS CONTRADITORY AN-SWERS or IN DOUBT) and 3) the pair of answers evaluated For examplewe have

QUESTION 5 EQUIVALENCE -Lee Myung-bak AND Lee

In other words for the 5th question (lsquoWho is the president of South Ko-rearsquo) there is a relationship of equivalence between the answersreferences lsquoLeeMyung-bakrsquo and lsquoLeersquo

Figure 6 Execution of the application to create relationsrsquo corpora in the com-mand line

Over the next chapter we will try to improve the results working directlywith several possible distance measures (and combinations of two or more ofthese) we can apply during the clustering phase

43

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 44: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

5 Clustering Answers

As we concluded in Section 44 results were still not improving as much as onecan hope for But we still had new clustering metrics to potentially improveour system as we proposed also doing Therefore we decided to try new ex-periments by bringing new similarity metrics to the clustering stage The mainmotivation was to find a new measure (or a combination of measures) that couldbring better results than Levenshtein did for the baseline

51 Testing Different Similarity Metrics

511 Clusterer Metrics

Our practical work here began with experimenting different similarity metricsintegrating them into the distances package which contained the Overlap andLevenshtein distances For that several lexical metrics that were used for a Nat-ural Language project in 20102011 at Tagus Park were added These metricswere taken from a Java similarity package (library) ndash lsquoukacshefwitsimmetricssimilaritymetricsrsquo

ndash Jaccard Index [11] ndash the Jaccard index (also known as Jaccard similaritycoefficient) is defined by the size of the intersection divided by the size ofthe union between two sets (A and B)

J(AB) =|A capB||A cupB|

(15)

In other words the idea will be to divide the number of characters incommon by the sum of the total number of characters of the two sequences

ndash Dicersquos Coefficient [6] ndash related to Jaccard but the similarity is now com-puted using the double of the number of characters that the two sequencesshare in common With X and Y as two sets of keywords Dicersquos coefficientis given by the following formula

Dice(AB) =2|A capB||A|+ |B|

(16)

Dicersquos Coefficient will basically give the double of weight to the intersectingelements doubling the final score Consider the following two stringslsquoKennedyrsquo and lsquoKeneddyrsquo When taken as a string similarity measure thecoefficient can be calculated using the sets of bigrams for each of the twostrings [15] The set of bigrams for each word is Ke en nn ne ed dyand Ke en ne ed dd dy respectively Each set has six elements andthe intersection of these two sets gives us exactly four elements Ke enne dy The result would be (24)(6+6) = 812 = 23 If we were usingJaccard we would have 412 = 13

44

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 45: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

ndash Jaro [13 12] ndash this distance is mainly used in the record linkage (duplicatedetection) area It is calculated by the following

Jaro(s1 s2) =1

3(m

|s1|+

m

|s2|+mminus tm

) (17)

in which s1 and s2 are the two sequences to be tested |s1| and |s2| aretheir lengths m is the number of matching characters in s1 and s2 and tis the number of transpositions needed to put the matching characters inthe same order

ndash Jaro-Winkler [45] ndash is an extension of the Jaro metric This distance isgiven by

dw = dj + (lp(1minus dj)) (18)

with dj being the Jaro distance between the two sequences l the length ofthe common prefix and p a constant scaling factor (the standard value forthis constant in Winklerrsquos work is p = 01) Jaro-Winkler therefore putsmore weight into sharing a common prefix compared to the original Jarodistance In other words it will give a much better score to strings suchas lsquoArnoldrsquo vs lsquoArniersquo or misspellings such as lsquoMartharsquo vs lsquoMarhtarsquo thanJaro

ndash Needleman-Wunsch [30] ndash dynamic programming algorithm similar toLevenshtein distance but adding a variable cost G (instead of the constant1 for Levenshtein)

D(i j) =

D(iminus 1 j minus 1) +GD(iminus 1 j) +GD(i j minus 1) +G

(19)

This package contained other metrics of particular interest to test suchas the following two measures which were also added and tested

ndash Soundex [36] ndash With this algorithm more focused to match words whichsound alike but differ lexically we can expect to match misspelled answerswith a higher Levenshtein distance than the threshold proposed (such aslsquoMassachussetsrsquo vs lsquoMachassucetsrsquo) or even match answers in differentlanguages that happen to sound alike but spelled differently

ndash Smith-Waterman [40] ndash is a variation of Needleman-Wunsch commonlyused in bioinformatics to align protein or nucleotide sequences It com-pares segments of all possible lengths (local alignments) instead of lookingat the sequence in its entirety and chooses the one that maximises thesimilarity measure leading us to believe that it might work better withanswers with common affixes such as lsquoactingrsquo and lsquoenactrsquo

45

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 46: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Then the LogSim metric by Cordeiro and his team described over theRelated Work in Section 322 was implemented and added to the packageusing the same weighting penalty of 27 as in the teamrsquos experiments We thentested all the similarity measures using three different thresholds values 12(05) 13(033) and 16(017) of the single-link clustering algorithm in orderto evaluate how much does it affect the performance of each metric

512 Results

Results are described in Table 8 We decided to evaluate these metrics usingthe corpus of 200 questions for a simple reason it is by far the corpus thatprovides us the best results of the system at this point and we should expectproporcional results between these two corpus either way

513 Results Analysis

As we can see the metric that achieves the best results is still Levenshteindistance with the LogSim distance at a close second place overall ndash even holdingthe distintion of being the best measure for the highest threshold evaluated (05)Defining the lsquobestrsquo threshold was not as clear as picking the best metric giventhese results The threshold of 017 provides the best results for more than halfof the metrics here but there are cases when it is only the second best even ifthe difference itself is virtually inexistent in most cases

46

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 47: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

MetricThreshold 017 033 05Levenshtein P511

R870A445

P437R875A385

P373R845A315

Overlap P379R870A330

P330R870A330

P364R865A315

Jaccard index P98R560A55

P99R555A55

P99R555A55

Dice Coefficient P98R560A55

P99R560A55

P99R555A55

Jaro P110R635A70

P119R630A75

P98R560A55

Jaro-Winkler P110R635A70

P119R630A75

P98R560A55

Needleman-Wunch P437R870A380

P437R870A380

P169R590A100

LogSim P437R870A380

P437R870A380

A420R870A365

Soundex P359R835A300

P359R835A300

P287R820A235

Smith-Waterman P173R635A110

P134R595A80

P104R575A60

Table 8 Precision (P) Recall (R) and Accuracy (A) results for every similaritymetric added for a given threshold value using the test corpus of 200 questionsand 32 passages from Google Best precision and accuracy of the system bolded

52 New Distance Metric - rsquoJFKDistancersquo

521 Description

Considering the two candidate answers lsquoJF Kennedyrsquo and lsquoJohn FitzgeraldKennedyrsquo Levenshtein will give us a very high distance and therefore will notgroup these two answers into a single cluster The goal here is to finally beable to relate answers such as these A new measure was therefore created in

47

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 48: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

order to improve the overall accuracy of JustAsk This measure is in a waya progression from metrics such as Overlap Distance and LogSim also using atokenizer to break the strings into a set of tokens and then taking the basicconcept of common terms (we can also call it links words or tokens) betweenthe two strings and also adding an extra weight to strings that not only shareat least a perfect match but have also another token that shares the first letterwith the other token from the other answer To exemplify consider the exampleabove We have a perfect match in the common term Kennedy and then wehave two tokens unmatched Even the apparently intuitive Overlap distancewill give us a value of 23 making it impossible to cluster these two answersLevenshtein will be closer to 05 Still underwhelming Since we already haveone word in common with the new metric we can assume with some securitythat there is a relation between lsquoJohnrsquo and lsquoJrsquo and lsquoFitzgeraldrsquo and lsquoFrsquo Thisweight will obviously not be as big as the score we give to a lsquoperfectrsquo match butit should help us clustering more of these types of answers For that matter wedivide by 2 what would be a value identical to a match between two tokens forthat extra weight

The new distance is computed the following way

distance = 1minus commonTermsScoreminus firstLetterMatchesScore (20)

in which the scores for the common terms and matches of the first letter ofeach token are given by

commonTermsScore =commonTerms

max(length(a1) length(a2)) (21)

and

firstLetterMatchesScore =firstLetterMatchesnonCommonTerms

2(22)

with a1 and a2 being the two answers converted to strings after normaliza-tion and nonCommonTerms being the terms that remain unmatched such aslsquoJrsquo lsquoJohnrsquo lsquoFrsquo andlsquoFitzgeraldrsquo in the above example

A few flaws about this measure were thought later on namely the scopeof this measure whithin our topology of answers if for example we applythis same measure to answers of type Numeric results could become quiteunpredictable and wrong It became quite clear that a separation of measuresgiven a type of answer (Numeric Entity Location etc) was needed if wewanted to use this measure to its full potential as we will see

48

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 49: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

522 Results

Table 9 shows us the results after the implementation of this new distancemetric

200 questions 1210 questionsBefore P511 R870

A445P282 R798A225

After new metric P517 R870A450

P292 R798A233

Table 9 Precision (P) Recall (R) and Accuracy (A) Results after the imple-mentation of the new distance metric Using 30 passages from New York Timesfor the corpus of 1210 questions from TREC

523 Results Analysis

As we can see for these results even with the redundancy of sharing featuresthat we also have in the rules and in certain normalizers this metric can actu-ally improve our results even if the difference is minimal and not statisticallysignificant

53 Possible new features ndash Combination of distance met-rics to enhance overall results

We also thought in analyzing these results by the question category using anevaluation framework developed earlier The motivation behind this evaluationwas to build a decision tree of sorts that would basically choose a metric (or aconjunction of metrics) for a certain category (or a group of categories) If forinstance we detect that the Levenshtein distance gives us the best answers forthe category Human but for the category Location LogSim was the metricwhich offered us the best results we would have something like

if (Category == Human) use Levenshtein

if (Category == Location) use LogSim

()

For that effort we used the best thresholds for each metric using the corpusof 200 questions and Google search engine retrieving 32 passages Our questionclassifier achieves an accuracy of 90 [39] and there have been a few answersnot labeled as correct by the framework that are indeed correct With that inmind and considering that these are issues that are common to every singledistance metric used (so they should not affect that much the analysis we are

49

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 50: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

looking for) here are the results for every metric incorporated by the threemain categories Human Location and Numeric that can be easily evaluated(categories such as Description by retrieving the Wikipedia abstract are justmarked as wrongnil given the reference input) shown in Table 10

MetricCategory Human Location NumericLevenshtein 5000 5385 484Overlap 4242 4103 484Jaccard index 758 000 000Dice 758 000 000Jaro 909 000 000Jaro-Winkler 909 000 000Needleman-Wunch 4242 4615 484LogSim 4242 4615 484Soundex 4242 4615 323Smith-Waterman 909 256 000JFKDistance 5455 5385 484

Table 10 Accuracy results for every similarity metric added for each categoryusing the test corpus of 200 questions and 32 passages from Google Bestprecision and accuracy of the system bolded except for when all the metricsgive the same result

We can easily notice that all of these metrics behave in similar ways througheach category ie results are always lsquoproporcionalrsquo in each category to howthey perform overall That stops us from picking best metrics for each categorysince we find out that the new distance implemented equals or surpasses (aswith the case of Human category) all the others

54 Posterior analysis

In another effort to improve the results above a deeper analysis was madethrough the systemrsquos output regarding the corpus test of 200 questions in orderto detect certain lsquoerror patternsrsquo that could be resolved in the future Below arethe main points of improvement

ndash Time-changing answers ndash Some questions do not have a straight answerone that never changes with time A clear example is a question suchas lsquoWho is the prime-minister of Japanrsquo We have observed that formerprime-minister Yukio Hatoyama actually earns a higher score than cur-rent prime-minister Naoto Kan which should be the correct answer Thesolution Give an extra weight to passages that are at least from the sameyear of the collected corpus from Google (in this case 2010)

ndash Numeric answers are easily failable due to the fact that when we expecta number it can pick up words such as lsquoonersquo actually retrieving that as

50

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 51: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

a top answer in a couple of cases Moreover when it expects an answerof type NumericDate it could still easely choose the dateyear whenthe article was posted (or even the number of views of that article) Thesolution might pass for a) ignoring lsquoonersquo to lsquothreersquo when written in wordsb) exclude the publishing date from the snippet somehow and c) improvethe regular expressions used to filter candidates

ndash Filtering stage does not seem to work in certain questions ndash in some ques-tions we found that one or more of the final answers are actually withinthe question under certain variations For instance for the question lsquoWhois John J Famalaro accused of having killedrsquo John Famalaro is the topanswer retrieved The solution here should be applying a set of rules toremove these variations from the final answers

ndash Answers in different formats than expected ndash some questions imply mul-tiple answers but we are only searching for a single one Here is a goodexample the correct answer for the question lsquoWhen did Innsbruck hostthe Winter Olympicsrsquo is actually two dates as it is stated in the refer-ence file but the system only goes to retrieve one of them leading to aincorrect answer Should we put the multiple answers separated as singlereferences Other questions such as lsquoWhen did the journalist CharlotteBass liversquo demand a time interval and yet a single year is retrieved Whatto do in this case

ndash Difficulty tracing certain types of answer (category confusion) ndash for thequestion lsquoWhat position does Lothar Matthaus playrsquo the answer lsquoGer-manyrsquo appears on top In another example of not finding the right typeof answer the question lsquoWhat was the first film directed by John Miliusrsquois supposed to answer lsquoDillingerrsquo but chooses lsquowesternrsquo and lsquodocumentaryrsquoas answers instead We know that the question classifier is not 100 foulproof but maybe we should try to make a list of acceptable terms forcertain headwords And in the case of films and other works perhapstrace only those between ldquordquo if possible

ndash Incomplete references ndash the most easily detectable error to be correctedis the reference file that presents us all the possible lsquocorrectrsquo answerspotentially missing viable references

After the insertion of gaps such as this while also removing a contra-ditory reference or two the results improved greatly and the system nowmanages to answer correctly 96 of the 200 questions for an accuracy of480 and a precision of 552

51

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 52: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

6 Conclusions and Future Work

Over the course of this thesis we have experimented new ways of improvingthe Answers Module of JustAsk our own QA system

Here are the main contributions achieved

ndash We performed a state of art on relations between answersrsquo relations andparaphrase identification to detect new possible ways to improve JustAskrsquosresults with the Answer Extraction module

ndash We implemented a new relations module that boosts the score of candidateanswers sharing relations of equivalence and inclusion

ndash We developed new metrics to improve the clustering of answers

ndash We developed new answer normalizers to relate two equal answers in dif-ferent formats

ndash We developed new corpora that we believe will be extremely useful in thefuture

ndash We performed further analysis to detect persistent error patterns

In the end results were not as good as we could hope for but we still managedto improve the systemrsquos precision as high as 91 using a modified version ofthe Multisix corpus composed of 200 questions

There is still many work to be done in the future We understand thatthe problem here does not lie exclusively with the Answer module of JustAskndash what happens before has a severe impact over this module namely duringquery formulation and passage retrieval Over the many tasks we should tryhere are a selected few

ndash Develop a more semantic clusterer instead of merely lsquoortographicalrsquo byadding more rules that deal with the semantic of the answers

ndash Use the rules as a distance metric of its own

ndash Finally implementing solutions to the problems and questions unansweredover Section 54 can help improving somewhat these results

52

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 53: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

References

[1] S Banerjee and T Pedersen Extended gloss overlaps as a measure ofsemantic relatedness In Proc of the 18th Intrsquol Joint Conf on ArtificialIntelligence pages 805ndash810 2003

[2] Regina Barzilay and Noemie Elhadad Sentence alignment for monolingualcomparable corpora In In Proceedings of the 2003 conference on Empiricalmethods in natural language processing pages 25ndash32 2003

[3] Regina Barzilay and Lillian Lee Learning to paraphrase An unsuper-vised approach using multiple-sequence alignment In Proceedings of theJoint Human Language TechnologyNorth American Chapter of the ACLConference (HLT-NAACL) pages 16ndash23 2003

[4] Marie catherine De Marneffe Bill Maccartney and Christopher D Man-ning Generating typed dependency parses from phrase structure parsesIn In LREC 2006 2006

[5] Joao Cordeiro Gael Dias and Pavel Brazdil Unsupervised learning ofparaphrases In In Research in Computer Science National PolytechnicInstitute Mexico ISSN pages 1870ndash4069 2007

[6] L Dice Measures of the amount of ecologic association between speciesEcology 26(3)297ndash302 1945

[7] Bill Dolan Chris Quirk and Chris Brockett Unsupervised constructionof large paraphrase corpora Exploiting massively parallel news sourcesIn In Proceedings of the 20th International Conference on ComputationalLinguistics 2004

[8] S Fernando and M Stevenson A semantic similarity approach to para-phrase detection In In CLUCK 2008 pages 18ndash26 2008

[9] Vasileios Hatzivassiloglou Judith L Klavans and Eleazar Eskin Detectingtext similarity over short passages Exploring linguistic feature combina-tions via machine learning 1999

[10] D Hays Dependency theory A formalism and some observations Lan-guages 40 pages 511ndash525 1964

[11] P Jaccard Distribution de la flore alpine dans le bassin des Dranses etdans quelques regions voisines Bulletin de la Societe Vaudoise des SciencesNaturelles 37241ndash272 1901

[12] M A Jaro Probabilistic linkage of large public health data file In Statisticsin Medicine volume 14 pages 491ndash498 1995

[13] Matthew A Jaro Advances in record-linkage methodology as applied tomatching the 1985 census of Tampa Florida Journal of the AmericanStatistical Association 84(406)414ndash420 1989

53

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 54: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

[14] JJ Jiang and DW Conrath Semantic similarity based on corpus statis-tics and lexical taxonomy In Proc of the Intrsquol Conf on Research inComputational Linguistics pages 19ndash33 1997

[15] Grzegorz Kondrak Daniel Marcu and Kevin Knight Cognates can im-prove statistical translation models In Proceedings of the 2003 Confer-ence of the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology companion volume of the Pro-ceedings of HLT-NAACL 2003ndashshort papers - Volume 2 NAACL-Short rsquo03pages 46ndash48 Association for Computational Linguistics 2003

[16] Zornitsa Kozareva Sonia Vazquez and Andres Montoyo The effect of se-mantic knowledge expansion to textual entailment recognition In TSD vol-ume 4188 of Lecture Notes in Computer Science pages 143ndash150 Springer2006

[17] T K Landauer and S T Dumais A solution to platorsquos problem The latentsemantic analysis theory of the acquisition induction and representationof knowledge Psychological Review 104211ndash240 1997

[18] C Leacock and M Chodorow Combining local context and wordnet sim-ilarity for word sense identification In MIT Press pages 265ndash283 1998

[19] Vladimir Iosifovich Levenshtein Binary codes capable of correcting dele-tions insertions and reversals Soviet Physics Doklady 10(8)707ndash710 1966Doklady Akademii Nauk SSSR V163 No4 845-848 1965

[20] Xin Li and Dan Roth Learning question classifiers In Proceedings ofthe 19th international conference on Computational linguistics pages 1ndash7Association for Computational Linguistics 2002

[21] Dekang Lin Principle-based parsing without overgeneration In ACL pages112ndash120 1993

[22] Jimmy J Lin An exploration of the principles underlying redundancy-based factoid question answering ACM Trans Inf Syst 25(2) 2007

[23] Mihai C Lintean and Vasile Rus Paraphrase identification using weighteddependencies and word semantics In H Chad Lane and Hans W Guesgeneditors FLAIRS Conference AAAI Press 2009

[24] Bernardo Magnini Simone Romagnoli Alessandro Vallin Jesus HerreraAnselmo Penas Vıctor Peinado Felisa Verdejo and Maarten de Rijke Themultiple language question answering track at clef 2003 In CLEF volume3237 of Lecture Notes in Computer Science pages 471ndash486 Springer 2003

[25] Bernardo Magnini Alessandro Vallin Christelle Ayache Gregor ErbachAnselmo Pe nas Maarten de Rijke Paulo Rocha Kiril Ivanov Simov andRichard F E Sutcliffe Overview of the clef 2004 multilingual questionanswering track In CLEF volume 3491 of Lecture Notes in ComputerScience pages 371ndash391 Springer 2004

54

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 55: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

[26] A Mendes and L Coheur An approach to answer selection in questionanswering based on semantic relations 2011 Accepted for publication inIJCAI 2011 the 22nd International Joint Conference on Artificial Intelli-gence

[27] Rada Mihalcea Courtney Corley and Carlo Strapparava Corpus-basedand knowledge-based measures of text semantic similarity In In AAAIrsquo06pages 775ndash780 2006

[28] George A Miller Wordnet A lexical database for english Communicationsof the ACM 3839ndash41 1995

[29] Veronique Moriceau Numerical data integration for cooperative question-answering 2006

[30] Saul B Needleman and Christian D Wunsch A general method applicableto the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology 48(3)443ndash453 1970

[31] Franz Josef Och and Hermann Ney A systematic comparison of variousstatistical alignment models Computational Linguistics 29 2003

[32] Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu Bleu amethod for automatic evaluation of machine translation In ComputationalLinguistics (ACL) Proceedings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics (ACL) Philadelphia pages 311ndash3182002

[33] Long Qiu Min-Yen Kan and Tat-Seng Chua Paraphrase recognition viadissimilarity significance classification In EMNLP pages 18ndash26 ACL2006

[34] Philip Resnik Using information content to evaluate semantic similarity ina taxonomy In Proc of the 14th Intrsquol Joint Conf on Artificial Intelligence(AAAI) pages 448ndash453 1995

[35] Joana Ribeiro Final report in question-answer systems 2010 UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal

[36] R C Russell Soundex phonetic comparison system [cfus patents 1261167(1918) 1435663 (1922)] 1918

[37] Gerard Salton and Christopher Buckley Term-weighting approaches inautomatic text retrieval In INFORMATION PROCESSING AND MAN-AGEMENT pages 513ndash523 1988

[38] J Silva QA+MLWikipediaampGoogle Masterrsquos thesis UniversidadeTecnica de Lisboa ndash Instituto Superior Tecnico Portugal 2009

[39] J Silva L Coheur and A Mendes Just askndashA multi-pronged approachto question answering 2010 Submitted to Artificial Intelligence Tools

55

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 56: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

[40] T Smith and M Waterman Identification of common molecular subse-quences Journal of Molecular Biology 147(1)195ndash197 1981

[41] Mark Stevenson and Mark A Greenwood A semantic approach to IEpattern induction In ACL The Association for Computer Linguistics2005

[42] Peter D Turney Mining the web for synonyms PMI-IR versus LSA onTOEFL In EMCL rsquo01 Proceedings of the 12th European Conference onMachine Learning pages 491ndash502 Springer-Verlag 2001

[43] Ellen M Voorhees Overview of TREC 2003 In TREC pages 1ndash13 2003

[44] Bonnie Webber Claire Gardent and Johan Bos Position statement Infer-ence in question answering In Third international conference on LanguageResources and Evaluation - LRECrsquo02 Las Palmas Spain 2002

[45] William E Winkler String comparator metrics and enhanced decision rulesin the fellegi-sunter model of record linkage In Proceedings of the Sectionon Survey Research pages 354ndash359 1990

[46] Zhibiao Wu and Martha Palmer Verb semantics and lexical selectionIn Proc of the 32nd annual meeting on Association for ComputationalLinguistics pages 133ndash138 1994

[47] Yitao Zhang and Jon Patrick Paraphrase identification by text canonical-ization In Proceedings of the Australasian Language Technology Workshop2005 pages 160ndash166 Sydney Australia 2005

56

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 57: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Appendix

Appendix A ndash Rules for detecting relations between rules

57

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 58: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

Relations between answers

1 Rules for detecting relations between answers

This document specifies the rules used to detect relations of equivalence andinclusion between two candidate answers

Assuming candidate answers A1 and A2 and considering

bull Aij ndash the jth token of candidate answer Ai

bull Aijk ndash the kth char of the jth token of candidate answer Ai

bull lemma(X) ndash the lemma of X

bull ishypernym(XY ) ndash returns true if X is the hypernym of Y

bull length(X) ndash the number of tokens of X

bull contains(X Y ) ndash returns true if Ai contains Aj

bull containsic(X Y ) ndash returns true if X contains Y case insensitive

bull equals(X Y ) ndash returns true if X is lexicographically equal to Y

bull equalsic(X Y ) ndash returns true if X is lexicographically equal to Y caseinsensitive

bull matches(XRegExp) ndash returns true if X matches the regular expressionRegExp

11 Questions of category Date

Ai includes Aj if

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 1)ndash Example M01 Y2001 includes D01 M01 Y2001

bull contains(Aj Ai) and ((length(Aj) - length(Ai)) = 2)ndash Example Y2001 includes D01 M01 Y2001

1

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 59: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

1 Rules for detecting relations between answers 2

12 Questions of category Numeric

Given

bull number = (d+([0-9E]))

bull lessthan = ((less than)|(almost)|(up to)|(nearly))

bull morethan = ((more than)|(over)|(at least))

bull approx = ((approx[a-z])|(around)|(roughly)|(about)|(some))

and

bull max = 11 (or another value)

bull min = 09 (or another value)

bull numX ndash the numeric value in X

bull numXmax = numX max

bull numXmin = numX min

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and matches(Aj number) andmatches(Ai lsquolsquoAj( [a-z]+)rsquorsquo)ndash Example 200 people is equivalent to 200

bull matchesic(Ai lsquolsquo()bnumber( )rsquorsquo) andmatchesic(Aj lsquolsquo()bnumber( )rsquorsquo) and(numAimax gt numAj) and (numAimin lt numAj)ndash Example 210 people is equivalent to 200 people

Ai includes Aj if

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo)ndash Example approximately 200 people includes 210

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAi = numAj)ndash Example approximately 200 people includes 200

bull (length(Aj) = 1) and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bapprox number( )rsquorsquo) and(numAjmax gt numAi) and (numAjmin lt numAi)ndash Example approximately 200 people includes 210

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 60: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

1 Rules for detecting relations between answers 3

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()bmorethan number( )rsquorsquo) and(numAi lt numAj)ndash Example more than 200 includes 300

bull length(Aj) = 1 and matchesic(Aj number) andmatchesic(Ai lsquolsquo()blessthan number( )rsquorsquo) and(numAi gt numAj)ndash Example less than 300 includes 200

13 Questions of category Human

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai0Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to John Kennedy

bull (length(Ai) = 2) and (length(Aj) = 3) and equalsic(Ai1Aj1) andmatches(Ai0lsquolsquo(Mrs)|(Dr)|(Mister)|(Miss)|(Doctor)|(Lady)|(Madame)rsquorsquo)ndash Example Mr Kennedy is equivalent to John Kennedy

bull (length(Ai) = 3) and (length(Aj) = 3) and equalsic(Ai0Aj0) and equalsic(Ai2Aj2)and equalsic(Ai10Aj10)ndash Example John Fitzgerald Kennedy is equivalent to John F Kennedy

bull (length(Ai) = 3) and (length(Aj) = 2) and equalsic(Ai1Aj0) and equalsic(Ai2Aj1)ndash Example John F Kennedy is equivalent to F Kennedy

131 Questions of category HumanIndividual

Ai is equivalent to Aj if

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai1Aj0)ndash Example Mr Kennedy is equivalent to Kennedy

bull containsic(AiAj) ndash Example

bull levenshteinDistance(AiAj) lt 1 ndash Example

14 Questions of category Location

Ai is equivalent to Aj if

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai0lsquolsquo(republic|city|kingdom|island)rsquorsquo) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example Republic of China is equivalent to China

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work
Page 61: The Answer Question in Question Answering Systems · chitecture: Question Interpretation, Passage Retrieval and Answer Extraction. Unfortunately, the Answer Extraction module is not

1 Rules for detecting relations between answers 4

bull (length(Ai) = 3) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ai1lsquolsquo(of|in)rsquorsquo)ndash Example

bull (length(Ai) = 2) and (length(Aj) = 1) and equalsic(Ai2Aj0) andmatches(Ailsquolsquosins+Aj0rsquorsquo)ndash Example in China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) andmatches(Ailsquolsquo(republic|city|kingdom|island)s+(in|of)s+Aj0rsquorsquo)ndash Example in the lovely Republic of China is equivalent to China

bull (length(Ai) gt= 2) and (length(Aj) = 1) and matches(Ailsquolsquo(in|of)s+Aj0rsquorsquo)ndash Example

Ai includes Aj if

bull (length(Ai) = 1) and (length(Aj) = 3) and equalsic(Ai0Aj2) andequalsic(Aj1ldquordquo) andndash Example Florida includes Miami Florida

bull (length(Ai) = 1) and (length(Aj) = 4) and equalsic(Ai0Aj3) andequalsic(Aj2ldquordquo) andndash Example California includes Los Angeles California

bull (length(Ai) = 1) and (length(Aj) = 2) and equalsic(Ai0Aj1) andmatchesic(Aj0lsquolsquo(south|west|north|east|central)[a-z]rsquorsquo) andndash Example California includes East California

15 Questions of category Entity

Ai is equivalent to Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and equalsic(lemma(Ai)lemma(Aj))ndash Example

Ai includes Aj if

bull (length(Ai) lt= 2) and (length(Aj) lt= 2) and ishypernym(AiAj)ndash Example animal includes monkey

  • Introduction
  • Related Work
    • Relating Answers
    • Paraphrase Identification
      • SimFinder ndash A methodology based on linguistic analysis of comparable corpora
      • Methodologies purely based on lexical similarity
      • Canonicalization as complement to lexical features
      • Use of context to extract similar phrases
      • Use of dissimilarities and semantics to provide better similarity measures
        • Overview
          • JustAsk architecture and work points
            • JustAsk
              • Question Interpretation
              • Passage Retrieval
              • Answer Extraction
                • Experimental Setup
                  • Evaluation measures
                  • Multisix corpus
                  • TREC corpus
                    • Baseline
                      • Results with Multisix Corpus
                      • Results with TREC Corpus
                        • Error Analysis over the Answer Module
                          • Relating Answers
                            • Improving the Answer Extraction module Relations Module Implementation
                            • Additional Features
                              • Normalizers
                              • Proximity Measure
                                • Results
                                • Results Analysis
                                • Building an Evaluation Corpus
                                  • Clustering Answers
                                    • Testing Different Similarity Metrics
                                      • Clusterer Metrics
                                      • Results
                                      • Results Analysis
                                        • New Distance Metric - JFKDistance
                                          • Description
                                          • Results
                                          • Results Analysis
                                            • Possible new features ndash Combination of distance metrics to enhance overall results
                                            • Posterior analysis
                                              • Conclusions and Future Work