special topics on information retrieval
DESCRIPTION
Special Topics on Information Retrieval. Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ [email protected] University of Alabama at Birmingham, Fall 2010. Spoken Document Retrieval. Content of the section. Definition of the task Traditional architecture for SDR - PowerPoint PPT PresentationTRANSCRIPT
Special Topics onInformation Retrieval
Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/
University of Alabama at Birmingham, Fall 2010.
Spoken Document Retrieval
Special Topics on Information Retrieval3
Content of the section
• Definition of the task• Traditional architecture for SDR• Automatic Speech Recognition
– Basic ideas and common errors• Approaches for SDR
– Manipulating the ASR system– Recovering from recognition errors
Special Topics on Information Retrieval4
Motivation
• Speech is the primary and most convenient means of communication between humans
• Multimedia documents are becoming increasingly popular, and finding relevant information in them is a challenging task
“The problem of audio/speech retrieval is familiar to anyone who has returned from vacation to find an answering machine full of messages. If you are not fortunate, you may have to listen to the entire tape to find the urgent one from your boss”
Special Topics on Information Retrieval5
Spoken document retrieval (SDR)
• SDR refers to the task of finding segments from recorded speech that are relevant to a user’s information need.– Large amount of information produced today is in
spoken form: TV and radio broadcasts, recordings of meetings, lectures and telephone conversations.
– There are currently no universal tools for speech retrieval.
Ideas for carrying out this task?
Special Topics on Information Retrieval6
Ideal architecture
• An ideal system would simply concatenate an automatic speech recognition (ASR) system with a standard text indexing and retrieval system.
Problems with this architecture?
SpeechRecognizer
Text-basedIR system
AudioCollection Query
ResultsRecognized
text
Special Topics on Information Retrieval7
Missing slide…about ASR
• General architecture• Dimensions of the problem
Special Topics on Information Retrieval8
Basic components of the ASR system• Feature extraction
– Transforms the input waveform into a sequence of acoustic feature vectors; each vector representing the information of in a small time window of the signal
• Decoder– Combines information from the acoustic and language models
and finds the word sequence with the highest probability given the observed speech features.
– Acoustic models: have information on how phonemes in speech are realized as feature vectors
– Lexicon: list of words with a pronunciation for each word expressed as a phone sequence
– Language Model: estimates probabilities of word sequencesMore about this topic in: Spring 2011; CS 462/662/762 Natural Language Processing; Dr. Thamar Solorio.
Special Topics on Information Retrieval9
Common errors in the recognition stage
• Recognition errors are deletions, insertions and substitutions of legitimate words.– Errors at word level
• “governors” “governess”• “kilos” “killers”• “Sinn Fein” “shame fame”
– Errors on word boundaries• “Frattini” “Freeh teeny”• “leap shark ” “Liebschard”
Special Topics on Information Retrieval10
Transcription errors and IR• The main problem with the traditional SDR
architecture is the accuracy of the recognition output– Around 50% word accuracy on real-world tasks
• Example:
– What will happen with a query about “Saddam Hussein”?– How to handle the errors or incomplete output provided by
ASR systems?
Audio input the efforts by certain states to circumvent UN sanctions against the Saddam Hussein regime in Iraq
TextTranscription
the efforts by certain States to circumvent U. N. sanctions against the sit down the same regime in Iraq
Special Topics on Information Retrieval11
SDR in non spontaneous speechConclusions from TREC 2000 edition
Using a corpus of 550 hours of Broadcast News• Spoken news retrieval systems achieved
almost the same performance as traditional IR systems.– Even with error rates of around 40%, the e
effectiveness of an IR system falls less than 10%.– Long queries are better that short queries– Deletations and substitutions are more important
than insertions, especially for long queries.
Special Topics on Information Retrieval12
New challenges
• Spoken questions of short duration• Message-length documents
– For example, voice-mail messages, announcements, and so on.
• Other types of spoken documents such as dialogues, meetings, lectures, classes, etc.– Contexts where the word error rate is well over 50%.
• Applications such as:– Question answering– Summarizing speech
Special Topics on Information Retrieval13
Main ideas for SDR• Manipulating the ASR system (white box)
– Retrieval at phonetic level– Adding alternative recognition results
• N most likely paths in the lattice• Using the complete word lattice
• Recovering from recognition errors (black box)– Query and/or document expansion– Using multiple recognizers– Applying phonetic codification
Special Topics on Information Retrieval14
Alternative recognition results
• Speech recognizers aim to produce a transcription with as few errors as possible.– It is possible that the correct word appears among
the candidates the recognizer considers, it just gets mistakenly pruned away.
• Retrieval performance can be improved by adding several candidates to the transcription
Special Topics on Information Retrieval15
Query/document expansion (1)• The effect of out-of-vocabulary query words and
other recognition errors can be reduced by adding to the query extra terms that have similar meaning or that are otherwise likely to appear in the same documents as the query terms.– From the top ranked documents for the given query– Using associations extracted from the whole collection
• Common to use a parallel written document set.– This collection must be thematically related
Special Topics on Information Retrieval16
Query/document expansion (2)
• Other approach to the OOV problem is to expand word queries into in-vocabulary phrases according to intrinsic acoustic confusability and language model scores.– For example, taliban may be expanded to tell a
band.• The aim is to mimic mistakes the speech
recognizer makes when transcribing the audio.• It is dependent on the ASR system
Special Topics on Information Retrieval17
Using multiple recognizers
• Different independently developed recognizers tend to make different kinds of errors and combining the outputs might allow some errors to be recovered.
• Combination of scores can be done by any traditional information fusion method.– Good results have been obtained with simple
linear combinations.
Special Topics on Information Retrieval18
Using a phonetic codification
• Phonetic codifications allow to characterize words with similar pronunciations through the same code.
• Example of a Soundex codification:– Unix Sun Workstation → (U52000 S30000 W62300)– Unique some workstation → (U52000 S30000 W62300)
• The idea of this approach is to build an enriched representation of transcriptions by combining words and phonetic codes.
Special Topics on Information Retrieval19
The algorithm at a glance
1. Compute the phonetic codification for each transcription using a given algorithm.
2. Combine transcriptions and their phonetic codifications in order to form an enriched document representation.
3. Remove unimportant tokens from the new document representation.
– Stop words and the most frequent codes.4. Create a combined index using words and codes.
– Incoming queries need to be represented in the same way
Special Topics on Information Retrieval20
Example of the representation
Automatic transcription
…just your early discussions was roll wallenberg uh any recollection of of uh where he came from…
Phonetic codification
... J23000 Y60000 E64000 D22520 W20000 R40000 W45162 U00000 A50000 R24235 O10000 O10000 U00000 W60000 H00000 C50000 F65000 ...
Enriched representation
{just, early, discussions, roll, wallenberg, recollection, came, E64000, D22520, R40000, W45162, R24235}
• Query: Actions of Raoul Wallenberg{actions, raoul, wallenberg, A23520, R40000, W45162}
Geographic Information Retrieval
Special Topics on Information Retrieval22
Content of the section
• Definition of the task– Need of GIR– Kinds of geographical queries
• Main challenges of GIR– Toponyms identification and Disambiguation– Indexing for GIR– Measuring document similarities
• Re-ranking of retrieval results
Special Topics on Information Retrieval23
The need for GIR
Geographical information is recorded in a wide variety of media and document types
• Information technology for accessing geographical information has focused on the combination of digital maps and databases. GIS
• Systems to retrieve geographically specific information from the relatively unstructured documents that compose the Web. GIR
Special Topics on Information Retrieval24
The size of the need• It is estimated that one fifth of the queries submitted
to search engines have geographic meaning.– Among them, eighty percent can be associated with a
geographic place
OtherGeographic
Special Topics on Information Retrieval25
Definition of the task• Geographical Information Retrieval (GIR) considers the
search for documents based not only on conceptual keywords, but also on spatial information.
• A geographic query is defined by a tuple:
<what, relation, where>
Whisky making in the Scottish Islands
• <what> represents the thematic part • <where> is used to specify the geographical areas of interest. • <relation> specifies the “spatial relation”, which connects the
what and the where.
Special Topics on Information Retrieval26
Different kinds of queries• With concrete locations:
– “ETA in France” (GC049)• With locations and simple rules of relevant locations:
– “Car bombings near Madrid” (GC030)• With locations and complex rules of relevant locations:
– “Automotive industry around the Sea of Japan” (GC036)• With very general locations that are not necessarily in a gazetteer:
– “Snowstorms in North America” (GC028)• With quasi-locations (e.g. political) that are not found in a gazetteer:
– “Malaria in the tropics” (GC034)• Describing characteristics of the geographical location:
– “Cities near active volcanoes” (GC040)
Special Topics on Information Retrieval27
The problem• In classic IR, retrieved documents are ranked by
their similarity to the text of the query. • In a search engine with geographic capabilities, the
semantics of geographic terms should be considered as one of the ranking criteria.
The problem of weighting the geographic importance of a document can be reduced to computing the similarity between two geographic locations, one associated with
the query and other with the document.
Special Topics on Information Retrieval28
Challenges of GIR• Detecting geographical references in the form of place
names within text documents and in users’ queries• Disambiguating place names to determine which particular
instance of a name is intended• Geometric interpretation of the meaning of vague place
names (‘Midlands’) and spatial relations (‘near’)• Indexing documents with respect to their geographic
context as well as their non-spatial thematic content• Ranking the relevance of documents with respect to
geography as well as theme• Developing effective user interfaces that help users to find
what they want
Special Topics on Information Retrieval29
Detecting geographic references
• The process of geo-parsing is concerned with analyzing text to identify the presence of place names extension of Named Entity Recognition
• Problem is that place names (or toponyms) can be used to refer to places on Earth, but they also occur frequently within the names of organizations and as part of people’s names.– Washington, president or place? (PER vs. LOC)– Mexico, country or football team? (LOC vs. ORG)
Special Topics on Information Retrieval30
Two main approaches
• Knowledge-based– Using an existing gazetteer
• List containing information on geographical references (e.g. name, name variations, coordinates, class, size, additional information).
• Data-driven or supervised– Using statistical or machine learning methods
• Typical features are: Capitalisation, numeric symbols, punctuation marks, position in the sentence and the words.
Advantages and disadvantages?
Special Topics on Information Retrieval31
Disambiguating place names
• Once it has been established that a place name is being used in a geographic sense, the problem remains of determining uniquely the place to which the name refers Toponym resolution
– Paris is a place name, but it may refer to the capital of France, or to one of the more than a dozen Paris in the US, Canada and Gambia
The ambiguity of a toponym depends from the world knowledge that a system has
Human Errors in TR(taken from a presentation of Davide Buscaldi –UPV,Spain)
Selected Toponym Resources(taken from a presentation of Davide Buscaldi –UPV,Spain)
• Gazetteers– Geonames
http://www.geonames.org
–Wikipedia-World http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Georeferenzierung/Wikipedia-World/en
• Structured Resources– Yahoo! GeoPlanet
http://developer.yahoo.com/geo/geoplanet/
–Getty Thesaurus of Geographical Names http://www.getty.edu/research/conducting_research/vocabularies/tgn/
–(Geo)WordNet http://users.dsic.upv.es/grupos/nle/resources/geo-wn/download.html
Methods for Toponym Resolution
• Three broad categories:– Map-based
• They need geographical coordinates– Knowledge-based
• They need hierarchical resources– Data-driven or supervised
• They need a large enough set of labeled data• Many names occurring only once (it is
impossible to estimate their probabilities)
A Map-Based method: Smith 2001• The right referent is the one with minimum average distance
from the context locations• Reported precision
74% to 93%depending ontest collection
“One hundred years ago there existed in England the Association for the Promotion of the Unity of Christendom. ... A Birmingham newspaper printed in a column for children an article entitled “The True Story of Guy Fawkes,” ... An Anglican clergyman in Oxford sadly but frankly acknowledged to me that this is true. ... A notable example of this was the discussion of Christian unity by the Catholic Archbishop of Liverpool, …”
Conceptual Density TR: Buscaldi 2008• Adaptation of a WSD method based on Conceptual Density computed
over hierarchies of hypernyms to hierarchies of holonyms• Given an ambiguous place name, different subhierarchies are obtained
from WordNet: the sense related to the most densely marked subhierarchy is selected
World
UK
England
Liverpool Oxford
Birmingham(1)
USA
Birmingham(2)
Oxford
Mississippi
Alabama
“One hundred years ago there existed in England the Association for the Promotion of the Unity of Christendom. ... A Birmingham newspaper printed in a column for children an article entitled “The True Story of Guy Fawkes,” ... An Anglican clergyman in Oxford sadly but frankly acknowledged to me that this is true. ... A notable example of this was the discussion of Christian unity by the Catholic Archbishop of Liverpool, …”
Special Topics on Information Retrieval37
Spatial and textual indexing• Once identified the toponyms, it is necessary to
take advantage of them for indexing.• Main approach consists in a combination of:
– A textual index, using all words except toponyms– A geographic index, considering only the toponyms
• Toponyms ambiguity may or may not be resolved
implications of this? • Geo-index may be enriched using synonyms and
holonyms how to do it? Other related words?
Other indexing alternative?
Special Topics on Information Retrieval38
Geographical relevance rankingRetrieval of relevant documents requires matching the query specification to the characteristics of the indexed documents
• In Geo-IR there is a need to match the geographical component of the query with the geographical context of documents.
• Traditionally, they are used two scores, for thematic and geographic relevance, that are combined to find an overall relevance:
How to evaluate these similarities?How to consider the <relation> information?
Special Topics on Information Retrieval39
About the ranking function• Consider the following two queries:
– “Car bombings near Madrid” bomb– “Automotive industry around the Sea of Japan”
• What happen with a document mentioning:– Barajas, or Legánes, or Toledo? Or talking about dynamite?– Toyota but not “Automotive industry ”, or “Mikura Island”?
• How to differentiate between different relations (“in”, “near”, “at the north”, etc.)?
Special Topics on Information Retrieval40
Measuring geographic similarities• Main approach: query expansion using an external resource and
traditional word-comparison of documents. – Add to the query some related names of places
• Geographic or topological distance is lesser than a threshold.• Satisfy the query relation
– Documents may also be expanded at indexing time (synonyms and holonyms)
• Alternative approach: evaluate a geographic distance between the locations from the query and the document.– Using geographic distances
• Distance between points (latitude and longitude)• Intersection between Minimal Bounding Rectangles
– Topology distance• Computed from a given geographic resource
Special Topics on Information Retrieval41
Relevance feedback in Geo-IR• Traditional IR systems are able to retrieve the majority of the
relevant documents for most queries, but that they have severe difficulties to generate a pertinent ranking of them.
• Idea: Use relevance feedback for selecting some relevant documents and then re-rank the retrieval list using this information.
Special Topics on Information Retrieval42
Our proposed solution
• Based on a Markov random field (MRF) that aims at classifying the ranked documents as relevant or irrelevant.
• The MRF takes into account:– Information provided by the base retrieval system– Similarities among documents in the list– Relevance feedback information
• We reduced the problem of document re-ranking to that of minimizing an energy-function that represents a trade-off between document relevance and inter-document similarity.
Special Topics on Information Retrieval43
Proposed architecture
Special Topics on Information Retrieval44
Definition of the MRF• Each node represents a document from the original
retrieved list.• Each fi is a binary random variable
– fi = 1 indicates i-th document is relevant
– fi = 0 indicates that it is irrelevant.
• The task of the MRF is to find the most probable configuration F={f1,…, fN}.– Configuration that minimizes a given
energy function– Necessary to use an optimization
technique; we used ICM.
Special Topics on Information Retrieval45
Energy function• Combines the following information:
– Inter-document similarity (interaction potential) – Query-document similarity and rank information
(observation potential)• These two similarities are computed in a traditional
way, without special treatment for the geographic information.
Special Topics on Information Retrieval46
Observation potentialAssumption that relevant documents are very similar to
the query and at the same time it is very likely thatthey appear in the top positions
• Captures the affinity between the document associated to node fi and the query q.
• Incorporates information from the initial retrieval system– Use the position of documents in original list
Special Topics on Information Retrieval47
Interaction potentialAssumption that relevant documents are very similar
to each other, and less similar to irrelevant documents
• Assess how much support give same-valued documents to keep current value, and how much support give oppose-valued documents to change to contrary value.
Special Topics on Information Retrieval48
Relevance feedback• Use it as seed for building the initial configuration of
the MRF– We set fi = 1 for relevance feedback documents and we set
fj = 0 for the rest. – The MRF starts the energy minimization process knowing
what documents are potentially relevant to the query.• Inference process consists of identifying further
relevant documents in the list by propagating through the MRF the user relevance feedback information.
Special Topics on Information Retrieval49
Evaluation
• We employed the Geo-CLEF document collection composed from news articles from years 1994 and 1995.
• 100 topics from GeoCLEF 2005 to 2008. • We evaluated results using the Mean Average
Precision (MAP) and the precision at N(P@N).• Initial results produced by the vectorial space
model configured in Lemur using a TFIDF weighting scheme.
Special Topics on Information Retrieval50
Results
Special Topics on Information Retrieval51
Final comments• Recent results have shown that traditional IR systems
are able to retrieve the majority of the relevant documents for most queries, but that they have severe difficulties to generate a pertinent ranking of them. Indicates that the thematic part is the most important from a recall perspective Suggest using Geo-processing as a post-retrieval stage (Re-ranking)
Suggest using user interfaces for carrying out the interpretation of the query <relation>
Question Answering
Special Topics on Information Retrieval53
Content of the section
• Definition of the task• General architecture of a QA system
– Question classification– Passage retrieval– Answer extraction and ranking
• Answer validation• Multilingual question answering
Special Topics on Information Retrieval54
Question answering
• Due to the great amount of documents available online, better retrieval methods are required.
• Question Answering (QA) systems are applications whose aim is to provide inexperienced users with a flexible access to the information.
• These systems allow users to write a query in natural language and to obtain not a set of documents which contain the answer, but the concise answer itself.
Special Topics on Information Retrieval55
Input and output
• In resume, the goal of a QA system is to find the answer to an open domain question in a large document collection.– Input: questions (instead of keyword-based
queries)– Output: answers (instead of documents)
Given a question like: “Where is the Popocatepetl located?”, a QA system must respond “Mexico”, instead of just returning a list of documents related to the volcano
Special Topics on Information Retrieval56
Related technologies• Information Retrieval
– Retrieve relevant documents to a user query from a document collection.
• Queries expressed by a set of keywords • Information Extraction
– Template filling from text (i.e., filling slots in a database from sub-segments of text)
• Slots in templates are previously defined• Relational QA
– Translate questions to relational DB queries.• Answers are extracted from a given database.
Special Topics on Information Retrieval57
Complexity of the task
Special Topics on Information Retrieval58
Current work
• Questions about simple facts– Names, dates, quantities, lists of instances,
definition of terms and persons.
• Answers extracted from one single document– Direct answers; no (or small) inference required
• Mainly using one single document collection– Multilingual QA is the exception, but at the end
more methods use only one collection.
Special Topics on Information Retrieval59
Examples of factoid questions
Question AnswerWhen did the reunification of East and West Germany take
place? 1989
Who is the Prime Minister of Italy? Silvio Berlusconi
What is the capital of the Republic of South Africa? Pretoria
How many inhabitants does Sweden have? about 8,600,000
Who were the members of The Beatles? John, Paul, Ringo and George
How to –automatically– extract the answer to these questions?
Special Topics on Information Retrieval60
Answering simple questions is not so easy• Sir Winston Leonard Spencer-Churchill (30 November
1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War….son of Lord Randolph Churchill, a politician, and Lady Randolph Churchill, daughter of American millionaire Leonard Jerome.
– When Winston Churchill born?– Where Winston Churchill born?– What is the name of his mother/father?– What is the name of his grandparent? What is the problem with these questions?
Special Topics on Information Retrieval61
Typical QA pipeline
Question
QuestionAnalysis
QueryAnswer type
SearchEngine
DocumentCollection
PassageExtractor
AnswerSelector
Answers
NLP/knowledgeresources
DocsAnswer type
PassagesAnswer type
• Several systems retrieve passages directly from the collection.• Most systems return a list of candidate answers together with a
support passageHow to carry out these subtasks?
Special Topics on Information Retrieval62
Question analysis
• This module processes the question, analyzes the question type, and produces a set of keywords for retrieval.– Depending on the answer extraction strategies,
sometimes is necessary to perform syntactic and semantic analysis of the questions.
• One problem: how to define a taxonomy of questions?– More systems use a simple classification including
main types and some subtypes
Special Topics on Information Retrieval63
Question classification
How to achieve this classification?Types and subtypes in one single step?
Question Answer typeWhen did the reunification of East and West Germany take
place? DATE
Who is the Prime Minister of Italy? PERSON
What is the capital of the Republic of South Africa? PLACE (subtype: city)
Where is the Popocatepetl located? PLACE (subtype: country)
How many inhabitants does Sweden have? QUANTITY
Who was Gennady Lyachin? DEFINITION
Special Topics on Information Retrieval64
Main approaches
• Using handcrafted rules– Several rules for each kind of query
• By means of a supervised approach– Main approach considers only surface text
features (bag of words and bags of n-grams)• The most informative is the wh-word
– Questions are represented as binary feature vectors
– Best results using SVMs
Current classification accuracy up to 90%
Special Topics on Information Retrieval65
Passage retrieval
• An important component common to many question answering systems.
• It process sets of documents (maybe the entire collection) and return ranked lists of passages scored with respect to query terms.
• Two main issues:– Defining the passages– Measuring their relevance to a given question
How to achieve these two tasks?
Special Topics on Information Retrieval66
Dividing documents into passages• Discourse models
– Use the structural properties of the documents, such as sentences or paragraphs.
• Paragraphs of very different lengths.
• Semantic models– Divide each document into semantic pieces according to
the different topics in the document.• Requires automatic topic segmentation; also different lengths.
• Window models– Uses windows of a fixed size (usually a number of terms) to
determine passage boundaries.• May divide relevant sections in two passages.
Special Topics on Information Retrieval67
Approaches for passage retrieval
• Overlap-based passage retrieval– Common terms between query and passage.
• Density-based pasasage retrieval– Favors query terms that appear close together
• Some works have also explored:– Using the order of the words– Combining term overlap and question reformulations– Calculating similarity at syntactical level by using edit
distance between trees.
Special Topics on Information Retrieval68
Some final comments
• Boolean querying schemes perform well in the question answering task.
• The performance differences between various passage retrieval algorithms vary with the choice of document retriever– Suggests significant interactions between document
retrieval and passage retrieval.• The best algorithms in past evaluations employ
density-based measures for scoring query termsS.Tellex, B.Katz, J.Lin, A.Fernandes and G.Marton, Quantitative evaluation of passage retrieval algorithms for question answering, Proc. of SIGIR ’03, 2003, Toronto, Canada, pp. 41-47.
Special Topics on Information Retrieval69
Answer extraction
• This module performs detailed analysis to passages and locate the question’s answer.– Usually it produces a list of candidate answers and
ranks them according to some scoring function.• Most methods consider:
– The answer expected type– The level of overlap between question and passage– Redundancy of the answer in the passages– Restrictions on the semantic and syntactic roles of
the answer.
Special Topics on Information Retrieval70
Difficulties
• When did Nixon visit China?– Richard Nixon's trip to China in February 1972 was
a critically important– On Feb.21, 1972, U.S. President Richard Nixon
arrived in Beijing – In 1972, a quiet, unassuming China was visited by
an American president, Richard Nixon.
• Important to do morphological and syntactical analysis, to have world knowledge, and to perform some kind of normalization of answers.
Special Topics on Information Retrieval71
Evaluation of QA
• Accuracy: indicates the percentage of correctly answered questions.– This measure is calculated as the fraction of correct
answers plus correct nil with respect to the total number of questions.
• MRR: evaluates the list of candidate answers. It is the average of the reciprocal ranks of answers for a sample of queries.
Special Topics on Information Retrieval72
Multi-stream QA• Several QA approaches, but most of them are
complementary. For instance, in the Portuguese QA track at CLEF 2008:
• The system “diue081” correctly responded 89% of the definition questions, but it only could respond to 35% of the factual questions.
• The combination of the correct answers from all participating systems (nine systems) outperformed by 49% the best individual result for factual questions.
• This fact indicates that a pertinent combination of various systems should allow to improve the individual results.
Special Topics on Information Retrieval73
Traditional approaches• Challenge is to select the correct answer for a given
question by combining the evidence from different input systems– Dark Horse Approach: considers the confidence of systems;
each systems has different confidences associated to factual and definition questions.
– Answer Chorus Approach: it relies on the answer redundancies; it selects as the final respond the answer with the highest frequency across streams.
– Web Chorus Approach: it uses information from the Web to evaluate the relevance of candidate answers. It selects the answer with the greatest number of Web pages containing the answer terms along with the question terms.
Special Topics on Information Retrieval74
Hybrid supervised approach
• Combines features from traditional and textual-entailment approaches that describe:– The redundancy of answers across streams– The compatibility between question and answer
types– The overlap and non-overlap information between
the question-answer pair and the support text.• As any other supervised approach it need a
tagged training data set.• Best results using SVMs
Special Topics on Information Retrieval75
Multilingual QA
• In a multilingual scenario, it is expected that QA systems will be able to: – Answer questions formulated in several languages– Look for answers in a number of collections in
different languages• Additional issues due to the language barrier:
– Translation of incoming questions into all target languages
– Combination of relevant information extracted from different languages
Special Topics on Information Retrieval76
Translation of questions• Most current systems translate questions to the
documents’ language.• Solution is very intuitive and seems effective, but it is
too sensitive to the translation errors.– Translation errors caused a drop in answer accuracy
• Current work on:– Performing triangulated translation using English as a
pivot language– Combining the capacities of several translation machines– Selecting best translation for current collection dataset
Special Topics on Information Retrieval77
Combining multilingual information
• The goal is to integrate information obtained from different languages into one single ranked list.
• Most systems rely on the translation of passages or answers to a common language.
• Two main approaches:– Combining passages
• Implement data fusion techniques from IR– Combining answers
• Use architectures from multi-stream QA
Special Topics on Information Retrieval78
Resume of results from QA@CLEF
Special Topics on Information Retrieval79
Resume of results from TREC 2007