searching question and answer archives dr. jiwoon jeon presented by charanya venkatesh kumar

SEARCHING QUESTION AND ANSWER ARCHIVES

Dr. Jiwoon Jeon

Presented by CHARANYA VENKATESH

KUMAR

Discussion

Current Information Retrieval systems?

OVERVIEW

Introduction Q&A Retrieval Test Collections Translation Based Q&A retrieval

framework Learning word-to-word translations

INTRODUCTION Q&A Retrieval problem Challenges

Semantically similar questions Problem : Word mismatch problem Solution : Machine translation-based

information retrieval model Quality of the Answers

Problem : Many answers to a given question Solution : Answer Quality Prediction

Technique

What is New? New Type of Information System New Translation-based Retrieval Model New Document Quality Estimation

Method Integration of Advances in Multiple

research Areas New Paraphrase Generation Method Utilizing Web as a Resource for Retrieval

OVERVIEW



Q & A RETRIEVAL

Question & Answer Archives Websites with FAQ Community based question answering

services Task Definition

Q & A Retrieval (Contd..)

Q & A Retrieval (Contd..)

Advantages Handle natural language questions Return answers instead of relevant

documents Disadvantages

Can answer only previously answered questions

Q & A RETRIEVAL SYSTEM ARCHITECTURE

CHALLENGES

Finding relevant Question & Answer Pairs Importance of question parts Word mismatch problem

Estimating Answer Quality Importance

OVERVIEW



TEST COLLECTIONS

Components : Set of documents Set of information needs (queries) Set of relevance judgment

Pooling Method

WONDIR COLLECTION

Earliest community based QA service in the US.

1 million question and answer pairs used from this service

Average question length = 27 words

Average answer length = 28 words

Examples

Queries Closed-class questions that ask

fact based short answers. E.g.: Where is Charlotte located?

Relevance Judgment 220 relevant Q&A pairs for 50 queries

using pooling method. Relevance Judgment Criteria

WebFAQ COLLECTIONby Jijkoun and Rijke

Collection of FAQs using web crawlers-made public for research purposes.

Found web pages that contain the word “FAQ”.

Used heuristic methods to automatically extract question and answer pairs from the web pages.

NAVER COLLECTION

Leading portal site in South Korea Community-based answering service Collection A :

Category information – To test category specific translations

Collection B : Non-Textual Information – To build

answer quality prediction technique

Naver Collection (Contd..)

Question – Title & Body Naver Test Collection A Naver Test Collection B Relevance :

Question semantically related to query and

Question contains all query terms Q&A pair was clicked multiple times for the

query.

Comparison of test Collections

OVERVIEW



Translation Based Q&A Retrieval framework

Use of Machine Translation technique for information retrieval

Word mismatch problem Translation based approach

IBM Statistical Machine translation Models

Do not require any linguistic knowledge of the source or target language.

Exploits only co-occurrence statistics of terms in training data.

IBM Models Model 1

Treats every possible word alignment equally

Model 2 Assumes only positions of terms are

related to the word alignment Model 3

The first term and the second term generated from the same term are independent

IBM Models (Contd..)

Model 4 First order alignment model Every word is dependent only on the

previous aligned word. Model 5

Reformulation of Model 4

Advantages of Model 1

Efficient implementation is possible using a form of query expansion.

Performance gain of using low level translation models is high.

Can be easily integrated into the query likelihood

IBM Model 1 Equation The probability that a query Q of length m

is the translation of a document D (of length n) is given as

IBM Model 1 Equation

Translation based Language Models

Language model is a mechanism for generating text.

Unigram language model Assumes each word is generated

independently Concerns only probabilities of

sampling a single word.

Language modeling approach to IR

In maximum likelihood estimator, unseen words in a document have zero probability.

Smoothing : Transfers some probability mass from the

seen words to the unseen words. Dirichlet smoothing – good

performance and cheap computational cost.

Language modeling approach to IR (Contd..) The ranking function for the query

likelihood language model with Dirichlet smoothing can be written as

IBM Model 1 vs. Query Likelihood Comparable components in the two

models

Self Translation Model

Every word has some probability to translate to itself.

Cannot be 1 If too low – deteriorate retrieval

performance

TransLM Final ranking Function looks like

Efficiency Issues and Implementation of TransLM Flipped Translation Tables

Term-at-a-time Algorithm

OVERVIEW



Properties of Word Relationships

Not Symmetric Not fixed Change depending on retrieval or

translation tasks. must be given as probability

values.

Training Sample Generation

Key Idea If two answers are very similar, then

the corresponding questions are semantically similar.

Similarity Measures Cosine Similarity Query Likelihood scores between two

answers (LM SCORE) LM-HRANK

Word Relationship Types

P(Q|A) Source – Answer ; Target – Question

P(A|Q) Source – Question ; Target – Answer

P(Q|Q) P(Q<->Q)

EM Algorithm Find word relationships that maximize

the likelihood of sampling the target text from the source text in training samples.

EM Algorithm (Contd..) The translation probability from a source

word t to a target word w is given as

Examples

Examples (Contd..)

SUMMARY



Coming Up Next…

Estimating Answer Quality Experiments

searching question and answer archives dr. jiwoon jeon presented by charanya venkatesh kumar

Documents

retrieval slide

word translations

faq community based

relevant qa pairs

word faq

searching question

charanya venkatesh kumar

services task definition