21/11/2002 the integration of lexical knowledge and external resources for qa hui yang, tat-seng...

21/11/2002

The Integration of Lexical Knowledge and External Resources for QA

Hui YANG, Tat-Seng Chua{yangh,chuats}@comp.nus.edu.sg

Pris, School of Computing

National University of Singapore

21/11/2002

Presentation Outline

Introduction Pris QA System Design Result and Analysis Conclusion Future Work

21/11/2002

Open Domain QA

Find answers to open-domain NLP questions by searching a large collection of documents

Question Processing May involve question re-formulation To find answer type

Query Expansion To overcome concept mis-match between query & info base

Search for Candidate Answers Documents, paragraphs, or sentences

Disambiguation Ranking (or re-ranking) of answers Location of exact answers

21/11/2002

Current Research Trends

Web-based QA the Web redundancy Probabilistic algorithm

Linguistic-based QA part-of-speech tagging syntactic parsing semantic relations named entity extraction dictionaries WordNet, etc

21/11/2002

System Overview

Question Classification Question Parsing Query Formulation Document Retrieval Candidate Sentence Retrieval Answer Extraction

21/11/2002

Question Classification

Question Parsing

Question Analysis

Web

WordNet

Query Formulation By External Knowledge

Document Retrieval

Sentence Ranking

Answer Extraction

Q

A

Original Content Words

Expanded Content Words

Relevant TREC doc

Candidate sentences

Reduce # of expanded content words

21/11/2002

Question Classification Based on question focus and answer type 7 main classes

HUM, LOC, TME, NUM, OBJ, DES, UNKNOWN E.g. “Which city is the capital of Canada ? ” (Q-class: LOC) E.g. “Which state is the capital of Canada in? ” (Q-class: LOC)

54 sub-classes E.g. under LOC (location), we have 14 sub-classes:

LOC_PLANET: 1 LOC_CITY: 18 LOC_CONTINENT: 3 LOC_COUNTRY: 18 LOC_COUNTY: 3 LOC_STATE: 3 LOC_PROVINCE: 2

LOC_TOWN: 2 LOC_RIVER: 3 LOC_LAKE: 2 LOC_MOUNTAIN: 1 LOC_OCEAN: 2 LOC_ISLAND: 3 LOC_BASIC: 3

21/11/2002

Question Parsing Content Words : q(0)

Nouns, adjectives, numbers, some verbs E.g. “What mythical Scottish town appears for one day every 100

years ?” Q-class: LOC_TOWN q(0) : (mythical,Scottish,town,appears,one,day,100,years)

Base Noun Phrases : n

n : (“mythical Scottish town”)

Head of the 1st Noun Phrase: h

h : (town)

Quotation Words: u

E.g. “What was the original name before " The Star Spangled Banner“ ? ”

u : (“The Star Spangled Banner”)

21/11/2002

Query Formulation I Use original Content Words as query to

search the Web (e.g. Google) Find new terms which have high correlation

with the original query Use WordNet to find the Synsets and Glosses

of original query terms Rank new query terms based on both Web

and WordNet Form new boolean query

21/11/2002

Query Formulation II

Original query q(0) = (q1(0), q2

(0),…, qk(0) )

Use Web as generalized resource From q(0) , retrieve top N documents qi

(0)q(0), extract nearby non-trivial words in one sentence or n words away to get wi

Rank wikwi by computing its probability of correlation with qi

(0)

# instances of (wik/\ qi (0))

Prob(wik) = ----------------------------------

# instances of (wik\/ qi (0))

Merge all wi to form Cq for q(0)

21/11/2002

Query Formulation III Use WordNet as generalized resource

qi(0)q(0), extract terms that are lexically related to qi

(0)

by locating them in Gloss Gi

Synset Si

For q(0), we get Gq and Sq

Re-rank wikwi by considering lexical relations wikCq, if wik Gi, wik increases

if wik Si, wik increases , (0<<<1)

Get q(1) = q(0) + {top m terms from Cq}

21/11/2002

Document Retrieval

1,033,461 documents from AP newswire, 1998-2000 New York Times newswire, 1998-2000 Xinhua News Agency, 1996-2000

MG Tool Boolean search to retrieve the top N

documents (N = 50) tk q(1) , Q = (t1 t2 … tn)

21/11/2002

Candidate Sentence Retrieval sent j in the top N documents, match with :

quotation words: Wuj = % of term overlap between u and Sentj

noun phrases: Wnj = % of phrase overlap between n and Sentj

head of first noun phrase: Whj = 1 if there is a match and 0 otherwise

original content words: Wcj = % of term overlap between q(0) and Sentj

expanded content words: Wej = % of term overlap between q(1-0) and Sentj , where q(1-0) = q(1) - q(0)

Final score , where αi=1, Wij{ Wuj , Wnj , Whj , Wcj , Wej }.

i

jS ijiWα

21/11/2002

Answer Extraction I

Fine-grained NE tagging for the top K sentences

For each sentence, extract the string which matches the Question Class E.g. “Who is Tom Cruise married to ?” Q-class: HUM_BASIC Top ranked Candidate Sentence:

Actor <HUM_PERSON Tom Cruise> and his wife <HUM_PERSON Nicole Kidman > accepted `` substantial '' libel damages on <TME_DATE Thursday> from a <LOC_COUNTRY British> newspaper that reported he was gay and that their marriage was a sham to cover it up .

Answer string: Nicole Kidman

21/11/2002

Answer Extraction II

For some questions, we cannot find any answer reduce the # of expanded query terms and repeat

the Document Retrieval, Candidate Sentence Retrieval and Answer Extraction

The whole process lasts for N iterations (N=5) If we still cannot find an exact answer, NIL is

considered as the answer increase recall step by step while preserving

precision

21/11/2002

Evaluation in TREC 2002

uninterpolated average precisionsum for i=1 to 500 (#-correct-up-to-

question-i/i) ------------------------------------------------------------

-- 500

We answer correctly 290 questions Score 0.61

21/11/2002

Result Analysis I

0

5

10

15

20

25

30

35

40

45

1 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

Num of Runs with Correct Answers

Nu

m Q

wit

h C

orr

ec

t A

ns

we

r

Total Num Q

Our Num of Q

21/11/2002

Result Analysis II Recognize no answers (NIL)

Precision : 41 / 170 = 0.241 Recall : 41 / 46 = 0.891

Non-Nil answers Precision: 249/330 = 0.755 Recall: 249/444 = 0.561

Overall Recall is low compare to precision – because Boolean search is strict.

21/11/2002

Result Analysis III

21/11/2002

Conclusion Integration of both Lexical Knowledge

and External Resources Detailed Question Classification Use of Fine-grained Named Entities for

Question Answering Successive Constraint Relaxation

21/11/2002

Future Work Refining our terms correlation by considering

a combination of local context, global context and lexical correlations

Exploring the structured use of external knowledge using the semantic perceptron net

Developing template-based answer selection Longer-term research plan : Interactive QA,

analysis and opinion questions

21/11/2002

Thank You !

21/11/2002 the integration of lexical knowledge and external resources for qa hui yang, tat-seng...

Documents

loc location

n documents n

n documents qi0q0

wik increases

original query use wordnet

townquotation words

wik gi

wik si