21/11/2002 the integration of lexical knowledge and external resources for qa hui yang, tat-seng...
TRANSCRIPT
21/11/2002
The Integration of Lexical Knowledge and External Resources for QA
Hui YANG, Tat-Seng Chua{yangh,chuats}@comp.nus.edu.sg
Pris, School of Computing
National University of Singapore
21/11/2002
Presentation Outline
Introduction Pris QA System Design Result and Analysis Conclusion Future Work
21/11/2002
Open Domain QA
Find answers to open-domain NLP questions by searching a large collection of documents
Question Processing May involve question re-formulation To find answer type
Query Expansion To overcome concept mis-match between query & info base
Search for Candidate Answers Documents, paragraphs, or sentences
Disambiguation Ranking (or re-ranking) of answers Location of exact answers
21/11/2002
Current Research Trends
Web-based QA the Web redundancy Probabilistic algorithm
Linguistic-based QA part-of-speech tagging syntactic parsing semantic relations named entity extraction dictionaries WordNet, etc
21/11/2002
System Overview
Question Classification Question Parsing Query Formulation Document Retrieval Candidate Sentence Retrieval Answer Extraction
21/11/2002
Question Classification
Question Parsing
Question Analysis
Web
WordNet
Query Formulation By External Knowledge
Document Retrieval
Sentence Ranking
Answer Extraction
Q
A
Original Content Words
Expanded Content Words
Relevant TREC doc
Candidate sentences
Reduce # of expanded content words
21/11/2002
Question Classification Based on question focus and answer type 7 main classes
HUM, LOC, TME, NUM, OBJ, DES, UNKNOWN E.g. “Which city is the capital of Canada ? ” (Q-class: LOC) E.g. “Which state is the capital of Canada in? ” (Q-class: LOC)
54 sub-classes E.g. under LOC (location), we have 14 sub-classes:
LOC_PLANET: 1 LOC_CITY: 18 LOC_CONTINENT: 3 LOC_COUNTRY: 18 LOC_COUNTY: 3 LOC_STATE: 3 LOC_PROVINCE: 2
LOC_TOWN: 2 LOC_RIVER: 3 LOC_LAKE: 2 LOC_MOUNTAIN: 1 LOC_OCEAN: 2 LOC_ISLAND: 3 LOC_BASIC: 3
21/11/2002
Question Parsing Content Words : q(0)
Nouns, adjectives, numbers, some verbs E.g. “What mythical Scottish town appears for one day every 100
years ?” Q-class: LOC_TOWN q(0) : (mythical,Scottish,town,appears,one,day,100,years)
Base Noun Phrases : n
n : (“mythical Scottish town”)
Head of the 1st Noun Phrase: h
h : (town)
Quotation Words: u
E.g. “What was the original name before " The Star Spangled Banner“ ? ”
u : (“The Star Spangled Banner”)
21/11/2002
Query Formulation I Use original Content Words as query to
search the Web (e.g. Google) Find new terms which have high correlation
with the original query Use WordNet to find the Synsets and Glosses
of original query terms Rank new query terms based on both Web
and WordNet Form new boolean query
21/11/2002
Query Formulation II
Original query q(0) = (q1(0), q2
(0),…, qk(0) )
Use Web as generalized resource From q(0) , retrieve top N documents qi
(0)q(0), extract nearby non-trivial words in one sentence or n words away to get wi
Rank wikwi by computing its probability of correlation with qi
(0)
# instances of (wik/\ qi (0))
Prob(wik) = ----------------------------------
# instances of (wik\/ qi (0))
Merge all wi to form Cq for q(0)
21/11/2002
Query Formulation III Use WordNet as generalized resource
qi(0)q(0), extract terms that are lexically related to qi
(0)
by locating them in Gloss Gi
Synset Si
For q(0), we get Gq and Sq
Re-rank wikwi by considering lexical relations wikCq, if wik Gi, wik increases
if wik Si, wik increases , (0<<<1)
Get q(1) = q(0) + {top m terms from Cq}
21/11/2002
Document Retrieval
1,033,461 documents from AP newswire, 1998-2000 New York Times newswire, 1998-2000 Xinhua News Agency, 1996-2000
MG Tool Boolean search to retrieve the top N
documents (N = 50) tk q(1) , Q = (t1 t2 … tn)
21/11/2002
Candidate Sentence Retrieval sent j in the top N documents, match with :
quotation words: Wuj = % of term overlap between u and Sentj
noun phrases: Wnj = % of phrase overlap between n and Sentj
head of first noun phrase: Whj = 1 if there is a match and 0 otherwise
original content words: Wcj = % of term overlap between q(0) and Sentj
expanded content words: Wej = % of term overlap between q(1-0) and Sentj , where q(1-0) = q(1) - q(0)
Final score , where αi=1, Wij{ Wuj , Wnj , Whj , Wcj , Wej }.
i
jS ijiWα
21/11/2002
Answer Extraction I
Fine-grained NE tagging for the top K sentences
For each sentence, extract the string which matches the Question Class E.g. “Who is Tom Cruise married to ?” Q-class: HUM_BASIC Top ranked Candidate Sentence:
Actor <HUM_PERSON Tom Cruise> and his wife <HUM_PERSON Nicole Kidman > accepted `` substantial '' libel damages on <TME_DATE Thursday> from a <LOC_COUNTRY British> newspaper that reported he was gay and that their marriage was a sham to cover it up .
Answer string: Nicole Kidman
21/11/2002
Answer Extraction II
For some questions, we cannot find any answer reduce the # of expanded query terms and repeat
the Document Retrieval, Candidate Sentence Retrieval and Answer Extraction
The whole process lasts for N iterations (N=5) If we still cannot find an exact answer, NIL is
considered as the answer increase recall step by step while preserving
precision
21/11/2002
Evaluation in TREC 2002
uninterpolated average precisionsum for i=1 to 500 (#-correct-up-to-
question-i/i) ------------------------------------------------------------
-- 500
We answer correctly 290 questions Score 0.61
21/11/2002
Result Analysis I
0
5
10
15
20
25
30
35
40
45
1 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
Num of Runs with Correct Answers
Nu
m Q
wit
h C
orr
ec
t A
ns
we
r
Total Num Q
Our Num of Q
21/11/2002
Result Analysis II Recognize no answers (NIL)
Precision : 41 / 170 = 0.241 Recall : 41 / 46 = 0.891
Non-Nil answers Precision: 249/330 = 0.755 Recall: 249/444 = 0.561
Overall Recall is low compare to precision – because Boolean search is strict.
21/11/2002
Result Analysis III
21/11/2002
Conclusion Integration of both Lexical Knowledge
and External Resources Detailed Question Classification Use of Fine-grained Named Entities for
Question Answering Successive Constraint Relaxation
21/11/2002
Future Work Refining our terms correlation by considering
a combination of local context, global context and lexical correlations
Exploring the structured use of external knowledge using the semantic perceptron net
Developing template-based answer selection Longer-term research plan : Interactive QA,
analysis and opinion questions
21/11/2002
Thank You !