1/32 exploring the performance of boolean retrieval strategies for open domain question answering...

11/32/32

Exploring the Performance Exploring the Performance of Boolean Retrieval of Boolean Retrieval Strategies for Open Strategies for Open

DomainDomainQuestion AnsweringQuestion Answering

IR4QAIR4QAInformation Retrieval for Question AnsweringInformation Retrieval for Question Answering

SIGIR 2004 WorkshopSIGIR 2004 WorkshopHoracio Saggion, Robert Gaizauskas, Mark Hepple,Horacio Saggion, Robert Gaizauskas, Mark Hepple,

Ian Roberts, Mark A. GreenwoodIan Roberts, Mark A. Greenwood(University of Sheffield, UK)(University of Sheffield, UK)

22/32/32

AbstractAbstract• Exploring and evaluating Boolean Exploring and evaluating Boolean

retrieval strategies for QAretrieval strategies for QA• Evaluation metrics: Evaluation metrics: coveragecoverage and and

redundancyredundancy

• A series of possible Boolean retrieval A series of possible Boolean retrieval strategies for use in QA.strategies for use in QA.

• Evaluate their performances.Evaluate their performances.• Understanding what query Understanding what query

formulations are better for QA.formulations are better for QA.

33/32/32

Introduction (1 of 2)Introduction (1 of 2)

• Open domain question answeringOpen domain question answering

• QA’s performance bound by IR QA’s performance bound by IR system.system.

Question IRAnswer

ExtractionAnswer

44/32/32

Introduction (2 of 2)Introduction (2 of 2)

• Present result by ranked retrieval Present result by ranked retrieval engines “off-the-shelf” (baseline: No AE)engines “off-the-shelf” (baseline: No AE)

• Describe previous work has used a Describe previous work has used a Boolean search strategy in QA.Boolean search strategy in QA.

• ExperimentsExperiments

• ResultsResults

• DiscussionDiscussion

55/32/32

Evaluation Measures (1 of 4)Evaluation Measures (1 of 4)

• CoverageCoverage• The proportion of the question set for which a The proportion of the question set for which a

correct answer can be found within the top n correct answer can be found within the top n passages retrieved for each question.passages retrieved for each question.

• Answer RedundancyAnswer Redundancy• The average number of passages within the top The average number of passages within the top

n ranks retrieved which contain a correct answer n ranks retrieved which contain a correct answer for per question.for per question.

• How many chances per question on average an How many chances per question on average an answer extraction component has to find an answer extraction component has to find an answer.answer.

66/32/32


Q: the question setQ: the question set

D: the document D: the document (corpus)(corpus)

n: top n ranksn: top n ranksA A D,qD,q: the subset of D which contains correct answ: the subset of D which contains correct answ

ers for q belong to Qers for q belong to QRRSS

D,q,nD,q,n: the n top-ranked documents (or passages) : the n top-ranked documents (or passages) in D retrieved by a retrieval system S given quein D retrieved by a retrieval system S given question qstion q

77/32/32


Corpus

With Answer

A A D,qD,q

Without AnswerRRSS

D,q,nD,q,n

88/32/32


• The actual redundancyThe actual redundancy

is the upper bound on the answer is the upper bound on the answer redundancy achievable by any QA system.redundancy achievable by any QA system.

• Comparing answer redundancy with actual Comparing answer redundancy with actual redundancy captures the same redundancy captures the same information that recall traditionally information that recall traditionally supplies.supplies.

99/32/32

Evaluation Dataset (1 of 3)Evaluation Dataset (1 of 3)

• Text collection is the AQUAINT corpus Text collection is the AQUAINT corpus 1,000,000 documents from newswire 1998-1,000,000 documents from newswire 1998-2000, 3.2G2000, 3.2G

• TREC2003 question set: 500 questions only TREC2003 question set: 500 questions only use 413 factoid questions, list or definition use 413 factoid questions, list or definition question.question.

• 51 questions which have no answer were 51 questions which have no answer were judged by human.judged by human.

• Our question set consists of 362 questions by Our question set consists of 362 questions by excluding these questions.excluding these questions.

• NIST produces the patterns of regular NIST produces the patterns of regular expression for each question which matches expression for each question which matches strings that contain the answer.strings that contain the answer.

1010/32/32


• There are two criterions for There are two criterions for correctness.correctness.• LenientLenient

•Any string drawn from the test collection which Any string drawn from the test collection which matches an answer pattern for a question.matches an answer pattern for a question.

• StrictStrict•A string matches an answer pattern is drawn A string matches an answer pattern is drawn

from a text which has been judged by a from a text which has been judged by a human assessor to support the answer. human assessor to support the answer.

1111/32/32


• We have estimated the actual We have estimated the actual redundancy of this text collection + redundancy of this text collection + question set to be 3.68, based on question set to be 3.68, based on taking the average number of texts per taking the average number of texts per question judged by human assessors to question judged by human assessors to support an answer.support an answer.• Some supporting documents may contain Some supporting documents may contain

more than one occurrence of the answer.more than one occurrence of the answer.• Not every document supports an answer is Not every document supports an answer is

likely to have been found by the assessors.likely to have been found by the assessors.

1212/32/32

OkapiOkapi

• The state-of-the-art in ranked The state-of-the-art in ranked retrieval.retrieval.

• All passages are done at search-time, All passages are done at search-time, not at index time.not at index time.

• Base on paragraph boundaries that Base on paragraph boundaries that have the length of 4 sentences. have the length of 4 sentences.

1313/32/32

LuceneLucene

• Open-source IR engine, Boolean query, ranked Open-source IR engine, Boolean query, ranked retrieval, standard tf.idf weighting, cosine simiretrieval, standard tf.idf weighting, cosine similarity measure.larity measure.

• Average paragraph length is about 1.5 sentencAverage paragraph length is about 1.5 sentences by splitting corpus into passages.es by splitting corpus into passages.

• Remove stopwordsRemove stopwords• Stemmed using the Porter stemmer.Stemmed using the Porter stemmer.• Queries consist of all the question words.Queries consist of all the question words.

1414/32/32

Z-PRISEZ-PRISE• Vector space retrieval system freely availVector space retrieval system freely avail

able from the National Institute of Standable from the National Institute of Standards and Technology (NIST)ards and Technology (NIST)

• The average sentences for each documeThe average sentences for each document that hasn’t split into passage are 24.nt that hasn’t split into passage are 24.• Any rank in Z-PRISE is greater amount of text.Any rank in Z-PRISE is greater amount of text.• Coverage and redundancy should better thaCoverage and redundancy should better tha

n Okapi and Lucene.n Okapi and Lucene.• It may bring a risk of lower performance.It may bring a risk of lower performance.

1515/32/32

1616/32/32

Boolean retrieval for QABoolean retrieval for QA

• We can simply take the words of the We can simply take the words of the question as a query with ranked retrieval.question as a query with ranked retrieval.

• Get ranked answer candidate passages.Get ranked answer candidate passages.• If terms doesn’t appear together in any If terms doesn’t appear together in any

passage of the entire collection, we try to passage of the entire collection, we try to • ‘‘Weaken’ the query. Weaken means delete Weaken’ the query. Weaken means delete

terms terms • GeneralizeGeneralize

• DynamicDynamic

1717/32/32

MURAXMURAX

• Kupiec’s MURAX. Knowledge base: GrolieKupiec’s MURAX. Knowledge base: Grolier’s on-line encyclopedia.r’s on-line encyclopedia.

• Analyses the question to locate noun phraAnalyses the question to locate noun phrase and main verbs, and form a query.se and main verbs, and form a query.

• Passages return, new query create to reduPassages return, new query create to reduce the number of hits (narrowing) or increace the number of hits (narrowing) or increase term (broadening)se term (broadening)

• Ranked by overlap with the initial question.Ranked by overlap with the initial question.

1818/32/32

FalconFalcon

• Uses the SMART retrieval engine.Uses the SMART retrieval engine.• Initial query is formulated by using Initial query is formulated by using

keywords from the question.keywords from the question.• Query may join a term: Query may join a term:

w1 -> w1 or w2 or w3w1 -> w1 or w2 or w3• Morphological (invent, inventor, invented)Morphological (invent, inventor, invented)• Lexical (assassin, killer)Lexical (assassin, killer)• Semantic (prefer, like)Semantic (prefer, like)

1919/32/32

SheffieldSheffield

• In-house Boolean search engine MadCow.In-house Boolean search engine MadCow.• Window size for both matching.Window size for both matching.• Query formulation: name expression.Query formulation: name expression.

• Bill Clinton: (Bill&Clinton) | (President&ClintBill Clinton: (Bill&Clinton) | (President&Clinton)on)

• If it fails or returns too many passage.If it fails or returns too many passage.• Extend an overly weak name conditionExtend an overly weak name condition• In place of any name conditionIn place of any name condition

2020/32/32

Experiments (1 of 8)Experiments (1 of 8)

• Understanding of query formulation.Understanding of query formulation.• Question analysisQuestion analysis• Term expansionTerm expansion• Query broadeningQuery broadening• Passage and matching-window sizePassage and matching-window size• RankingRanking

2121/32/32


• Minimal strategy (lower bound)Minimal strategy (lower bound)• Simplest approach, Simplest approach, AllTermsAllTerms: Use conjunctio: Use conjunctio

n of the question term to formulate query.n of the question term to formulate query.• Ex: How far is it from Earth to Mars? (Mars Ex: How far is it from Earth to Mars? (Mars

& Earth)& Earth)

• Q and P represent the set of non-stoplist term Q and P represent the set of non-stoplist term in the question and passage.in the question and passage.

• 178 of 362 questions return non-empty result.178 of 362 questions return non-empty result.

2222/32/32


• Simple Term ExpansionSimple Term Expansion• WordNetWordNet configuration explores by using sy configuration explores by using sy

nonymy expansion keyword-barrier.nonymy expansion keyword-barrier.• When Allterm has no matching, disjunction When Allterm has no matching, disjunction

of its synonymy.of its synonymy.• 202 of 362 have at least one matching.202 of 362 have at least one matching.

2323/32/32


• MorphVarMorphVar configuration explores the use configuration explores the use of morphological variants of query termof morphological variants of query terms.s.• Returning the same stem string when the PoReturning the same stem string when the Po

rter stemmer is applied to the corpus.rter stemmer is applied to the corpus.• When Allterm has no matching, disjunction When Allterm has no matching, disjunction

of its morphological variants.of its morphological variants.• 203 of 362 have at least one matching.203 of 362 have at least one matching.

2424/32/32

Experiments (5 of 8)Experiments (5 of 8)• Document Frequency BroadeningDocument Frequency Broadening

• DropBigDropBig configuration discards from the init configuration discards from the initial query the question term having highest dial query the question term having highest document frequency. (273 of 362)ocument frequency. (273 of 362)

• DropSmallDropSmall configuration discards from the i configuration discards from the initial query the question term having lowest nitial query the question term having lowest document frequency. (288 of 362)document frequency. (288 of 362)

• Iterative deletion (least) prefers highest and Iterative deletion (least) prefers highest and lowest frequency termlowest frequency term

• DropBig ->DropBig -> BigIte BigIte and DropSmall ->and DropSmall ->SmallIteSmallIte

2525/32/32


• Structure Analysis: Structure Analysis: StrIteStrIte• Distinguishes proper name and quoted exDistinguishes proper name and quoted ex

pression from other terms in a question.pression from other terms in a question.• POS tagging to identify proper nouns.POS tagging to identify proper nouns.• What is Richie’s surname on “Happy DaWhat is Richie’s surname on “Happy Da

ys”?ys”?• Name term: RichieName term: Richie• Quote term: Happy DaysQuote term: Happy Days• Common term: What is ‘s surname onCommon term: What is ‘s surname on

2626/32/32


• AllTerm -> StrIte: iteratively drops terms AllTerm -> StrIte: iteratively drops terms until at least one matching sentence is runtil at least one matching sentence is returned.eturned.

• Drop order: Drop order: • Common termCommon term• Name termName term• Quote termQuote term

2727/32/32


• StrIte -> StrIteMorph -> StrIteMorph20StrIte -> StrIteMorph -> StrIteMorph20• StrIteMorphStrIteMorph: from StrIte, where each term is e: from StrIte, where each term is e

xpanded with its morphological.xpanded with its morphological.• StrIteMorph20StrIteMorph20: from StrIteMorph, until at least : from StrIteMorph, until at least

20 sentences per question are retrieved.20 sentences per question are retrieved.• These 3 configurations, w(t) of 1/6 for commoThese 3 configurations, w(t) of 1/6 for commo

n terms, 2/6 for name terms, 3/6 for quote tern terms, 2/6 for name terms, 3/6 for quote terms.ms.

2828/32/32

Result (1 of 2)Result (1 of 2)

Mean number of sentences an answer Mean number of sentences an answer extraction system would have to process extraction system would have to process for rank k.for rank k.

2929/32/32

Result (2 of 2)Result (2 of 2)

3030/32/32

Discussion (1 of 2)Discussion (1 of 2)

• At rank 200 StrIteMorph20 achieves 62.1At rank 200 StrIteMorph20 achieves 62.15% coverage, as compared to 72.9% for 5% coverage, as compared to 72.9% for Lucene, 78.2% for Okapi, and 80.4% for Lucene, 78.2% for Okapi, and 80.4% for Z-PRISE.Z-PRISE.

• At rank 200 StrItmMorph20 return on avAt rank 200 StrItmMorph20 return on average around 137 sentences, Luncene arerage around 137 sentences, Luncene around 300, Okapi around 800, and Z-PRISound 300, Okapi around 800, and Z-PRISE aound 4600.E aound 4600.

3131/32/32

Discussion (2 of 2)Discussion (2 of 2)

• Downstream AE component should avoiDownstream AE component should avoid distraction in larger text volume.d distraction in larger text volume.

• Synonyms (WordNet) offer negligible adSynonyms (WordNet) offer negligible advantage.vantage.

• Expanding all term with morphological Expanding all term with morphological (MorphVar), doesn’t offer a major impr(MorphVar), doesn’t offer a major improvement.ovement.

3232/32/32

Future workFuture work

• The post-retrieval ranking of result The post-retrieval ranking of result needs to be explored in more detail. needs to be explored in more detail. Or other ranking method should be Or other ranking method should be explored.explored.

• The most effective window size.The most effective window size.

• Query refinement.Query refinement.

• Term expansion method.Term expansion method.

1/32 exploring the performance of boolean retrieval strategies for open domain question answering...

Documents

answer pattern

question answeringsigir

definition question

question setd

2gtrec2003 question

text collection question

answer redundancy achievable

ranked retrieval