1/32 exploring the performance of boolean retrieval strategies for open domain question answering...

Click here to load reader

Post on 02-Jan-2016

216 views

Category:

Documents

3 download

Embed Size (px)

TRANSCRIPT

  • Exploring the Performance of Boolean Retrieval Strategies for Open DomainQuestion AnsweringIR4QAInformation Retrieval for Question AnsweringSIGIR 2004 WorkshopHoracio Saggion, Robert Gaizauskas, Mark Hepple,Ian Roberts, Mark A. Greenwood(University of Sheffield, UK)

  • AbstractExploring and evaluating Boolean retrieval strategies for QAEvaluation metrics: coverage and redundancyA series of possible Boolean retrieval strategies for use in QA.Evaluate their performances.Understanding what query formulations are better for QA.

  • Introduction (1 of 2)Open domain question answering

    QAs performance bound by IR system.QuestionIRAnswerExtractionAnswer

  • Introduction (2 of 2)Present result by ranked retrieval engines off-the-shelf (baseline: No AE)Describe previous work has used a Boolean search strategy in QA.ExperimentsResultsDiscussion

  • Evaluation Measures (1 of 4)CoverageThe proportion of the question set for which a correct answer can be found within the top n passages retrieved for each question.Answer RedundancyThe average number of passages within the top n ranks retrieved which contain a correct answer for per question.How many chances per question on average an answer extraction component has to find an answer.

  • Evaluation Measures (2 of 4)Q: the question setD: the document (corpus)n: top n ranksA D,q: the subset of D which contains correct answers for q belong to QRSD,q,n: the n top-ranked documents (or passages) in D retrieved by a retrieval system S given question q

  • Evaluation Measures (3 of 4)CorpusWith AnswerA D,qWithout AnswerRSD,q,n

  • Evaluation Measures (4 of 4)The actual redundancy

    is the upper bound on the answer redundancy achievable by any QA system.Comparing answer redundancy with actual redundancy captures the same information that recall traditionally supplies.

  • Evaluation Dataset (1 of 3)Text collection is the AQUAINT corpus 1,000,000 documents from newswire 1998-2000, 3.2GTREC2003 question set: 500 questions only use 413 factoid questions, list or definition question.51 questions which have no answer were judged by human.Our question set consists of 362 questions by excluding these questions.NIST produces the patterns of regular expression for each question which matches strings that contain the answer.

  • Evaluation Dataset (2 of 3)There are two criterions for correctness.LenientAny string drawn from the test collection which matches an answer pattern for a question.StrictA string matches an answer pattern is drawn from a text which has been judged by a human assessor to support the answer.

  • Evaluation Dataset (3 of 3)We have estimated the actual redundancy of this text collection + question set to be 3.68, based on taking the average number of texts per question judged by human assessors to support an answer.Some supporting documents may contain more than one occurrence of the answer.Not every document supports an answer is likely to have been found by the assessors.

  • OkapiThe state-of-the-art in ranked retrieval.All passages are done at search-time, not at index time.Base on paragraph boundaries that have the length of 4 sentences.

  • LuceneOpen-source IR engine, Boolean query, ranked retrieval, standard tf.idf weighting, cosine similarity measure.Average paragraph length is about 1.5 sentences by splitting corpus into passages.Remove stopwordsStemmed using the Porter stemmer.Queries consist of all the question words.

  • Z-PRISEVector space retrieval system freely available from the National Institute of Standards and Technology (NIST)The average sentences for each document that hasnt split into passage are 24.Any rank in Z-PRISE is greater amount of text.Coverage and redundancy should better than Okapi and Lucene.It may bring a risk of lower performance.

  • Boolean retrieval for QAWe can simply take the words of the question as a query with ranked retrieval.Get ranked answer candidate passages.If terms doesnt appear together in any passage of the entire collection, we try to Weaken the query. Weaken means delete terms GeneralizeDynamic

  • MURAXKupiecs MURAX. Knowledge base: Groliers on-line encyclopedia.Analyses the question to locate noun phrase and main verbs, and form a query.Passages return, new query create to reduce the number of hits (narrowing) or increase term (broadening)Ranked by overlap with the initial question.

  • FalconUses the SMART retrieval engine.Initial query is formulated by using keywords from the question.Query may join a term: w1 -> w1 or w2 or w3Morphological (invent, inventor, invented)Lexical (assassin, killer)Semantic (prefer, like)

  • SheffieldIn-house Boolean search engine MadCow.Window size for both matching.Query formulation: name expression.Bill Clinton: (Bill&Clinton) | (President&Clinton)If it fails or returns too many passage.Extend an overly weak name conditionIn place of any name condition

  • Experiments (1 of 8)Understanding of query formulation.Question analysisTerm expansionQuery broadeningPassage and matching-window sizeRanking

  • Experiments (2 of 8)Minimal strategy (lower bound)Simplest approach, AllTerms: Use conjunction of the question term to formulate query.Ex: How far is it from Earth to Mars? (Mars & Earth)

    Q and P represent the set of non-stoplist term in the question and passage.178 of 362 questions return non-empty result.

  • Experiments (3 of 8)Simple Term ExpansionWordNet configuration explores by using synonymy expansion keyword-barrier.When Allterm has no matching, disjunction of its synonymy.202 of 362 have at least one matching.

  • Experiments (4 of 8)MorphVar configuration explores the use of morphological variants of query terms.Returning the same stem string when the Porter stemmer is applied to the corpus.When Allterm has no matching, disjunction of its morphological variants.203 of 362 have at least one matching.

  • Experiments (5 of 8)Document Frequency BroadeningDropBig configuration discards from the initial query the question term having highest document frequency. (273 of 362)DropSmall configuration discards from the initial query the question term having lowest document frequency. (288 of 362)Iterative deletion (least) prefers highest and lowest frequency termDropBig -> BigIte and DropSmall ->SmallIte

  • Experiments (6 of 8)Structure Analysis: StrIteDistinguishes proper name and quoted expression from other terms in a question.POS tagging to identify proper nouns.What is Richies surname on Happy Days?Name term: RichieQuote term: Happy DaysCommon term: What is s surname on

  • Experiments (7 of 8)AllTerm -> StrIte: iteratively drops terms until at least one matching sentence is returned.Drop order: Common termName termQuote term

  • Experiments (8 of 8)StrIte -> StrIteMorph -> StrIteMorph20StrIteMorph: from StrIte, where each term is expanded with its morphological.StrIteMorph20: from StrIteMorph, until at least 20 sentences per question are retrieved.These 3 configurations, w(t) of 1/6 for common terms, 2/6 for name terms, 3/6 for quote terms.

  • Result (1 of 2)Mean number of sentences an answer extraction system would have to process for rank k.

  • Result (2 of 2)

  • Discussion (1 of 2)At rank 200 StrIteMorph20 achieves 62.15% coverage, as compared to 72.9% for Lucene, 78.2% for Okapi, and 80.4% for Z-PRISE.At rank 200 StrItmMorph20 return on average around 137 sentences, Luncene around 300, Okapi around 800, and Z-PRISE aound 4600.

  • Discussion (2 of 2)Downstream AE component should avoid distraction in larger text volume.Synonyms (WordNet) offer negligible advantage.Expanding all term with morphological (MorphVar), doesnt offer a major improvement.

  • Future workThe post-retrieval ranking of result needs to be explored in more detail. Or other ranking method should be explored.The most effective window size.Query refinement.Term expansion method.

    Introduce the evaluation measures, the data set against which all our experiment.

    There are two criterion for correctnessLenientAny string drawn from the test collection which matches an answer pattern for a question Strict A string matches an answer pattern is drawn from a text which has been judged by a human assessor to support the answer

    Any string drawn from the test collection which matches an answer pattern for a question is said to be correct according to a lenient criterion for correctness.If in addition a string matching an answer pattern is drawn from a text which has been judged by a human assessor to support the answer, then the answer is said to be correct according to a strict criterion for correctness.

View more