searching question and answer archives dr. jiwoon jeon presented by charanya venkatesh kumar
TRANSCRIPT
SEARCHING QUESTION AND ANSWER ARCHIVES
Dr. Jiwoon Jeon
Presented by CHARANYA VENKATESH
KUMAR
Discussion
Current Information Retrieval systems?
OVERVIEW
Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval
framework Learning word-to-word translations
INTRODUCTION Q&A Retrieval problem Challenges
Semantically similar questions Problem : Word mismatch problem Solution : Machine translation-based
information retrieval model Quality of the Answers
Problem : Many answers to a given question Solution : Answer Quality Prediction
Technique
What is New? New Type of Information System New Translation-based Retrieval Model New Document Quality Estimation
Method Integration of Advances in Multiple
research Areas New Paraphrase Generation Method Utilizing Web as a Resource for Retrieval
OVERVIEW
Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval
framework Learning word-to-word translations
Q & A RETRIEVAL
Question & Answer Archives Websites with FAQ Community based question answering
services Task Definition
Q & A Retrieval (Contd..)
Q & A Retrieval (Contd..)
Advantages Handle natural language questions Return answers instead of relevant
documents Disadvantages
Can answer only previously answered questions
Q & A RETRIEVAL SYSTEM ARCHITECTURE
CHALLENGES
Finding relevant Question & Answer Pairs Importance of question parts Word mismatch problem
Estimating Answer Quality Importance
OVERVIEW
Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval
framework Learning word-to-word translations
TEST COLLECTIONS
Components : Set of documents Set of information needs (queries) Set of relevance judgment
Pooling Method
WONDIR COLLECTION
Earliest community based QA service in the US.
1 million question and answer pairs used from this service
Average question length = 27 words
Average answer length = 28 words
Examples
Queries Closed-class questions that ask
fact based short answers. E.g.: Where is Charlotte located?
Relevance Judgment 220 relevant Q&A pairs for 50 queries
using pooling method. Relevance Judgment Criteria
WebFAQ COLLECTIONby Jijkoun and Rijke
Collection of FAQs using web crawlers-made public for research purposes.
Found web pages that contain the word “FAQ”.
Used heuristic methods to automatically extract question and answer pairs from the web pages.
NAVER COLLECTION
Leading portal site in South Korea Community-based answering service Collection A :
Category information – To test category specific translations
Collection B : Non-Textual Information – To build
answer quality prediction technique
Naver Collection (Contd..)
Question – Title & Body Naver Test Collection A Naver Test Collection B Relevance :
Question semantically related to query and
Question contains all query terms Q&A pair was clicked multiple times for the
query.
Comparison of test Collections
OVERVIEW
Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval
framework Learning word-to-word translations
Translation Based Q&A Retrieval framework
Use of Machine Translation technique for information retrieval
Word mismatch problem Translation based approach
IBM Statistical Machine translation Models
Do not require any linguistic knowledge of the source or target language.
Exploits only co-occurrence statistics of terms in training data.
IBM Models Model 1
Treats every possible word alignment equally
Model 2 Assumes only positions of terms are
related to the word alignment Model 3
The first term and the second term generated from the same term are independent
IBM Models (Contd..)
Model 4 First order alignment model Every word is dependent only on the
previous aligned word. Model 5
Reformulation of Model 4
Advantages of Model 1
Efficient implementation is possible using a form of query expansion.
Performance gain of using low level translation models is high.
Can be easily integrated into the query likelihood
IBM Model 1 Equation The probability that a query Q of length m
is the translation of a document D (of length n) is given as
IBM Model 1 Equation
Translation based Language Models
Language model is a mechanism for generating text.
Unigram language model Assumes each word is generated
independently Concerns only probabilities of
sampling a single word.
Language modeling approach to IR
In maximum likelihood estimator, unseen words in a document have zero probability.
Smoothing : Transfers some probability mass from the
seen words to the unseen words. Dirichlet smoothing – good
performance and cheap computational cost.
Language modeling approach to IR (Contd..) The ranking function for the query
likelihood language model with Dirichlet smoothing can be written as
IBM Model 1 vs. Query Likelihood Comparable components in the two
models
Self Translation Model
Every word has some probability to translate to itself.
Cannot be 1 If too low – deteriorate retrieval
performance
TransLM Final ranking Function looks like
Efficiency Issues and Implementation of TransLM Flipped Translation Tables
Term-at-a-time Algorithm
OVERVIEW
Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval
framework Learning word-to-word translations
Properties of Word Relationships
Not Symmetric Not fixed Change depending on retrieval or
translation tasks. must be given as probability
values.
Training Sample Generation
Key Idea If two answers are very similar, then
the corresponding questions are semantically similar.
Similarity Measures Cosine Similarity Query Likelihood scores between two
answers (LM SCORE) LM-HRANK
Word Relationship Types
P(Q|A) Source – Answer ; Target – Question
P(A|Q) Source – Question ; Target – Answer
P(Q|Q) P(Q<->Q)
EM Algorithm Find word relationships that maximize
the likelihood of sampling the target text from the source text in training samples.
EM Algorithm (Contd..) The translation probability from a source
word t to a target word w is given as
EM Algorithm (Contd..) The translation probability from a source
word t to a target word w is given as
Examples
Examples (Contd..)
SUMMARY
Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval
framework Learning word-to-word translations
Coming Up Next…
Estimating Answer Quality Experiments