qa ir tutorial
TRANSCRIPT
-
8/14/2019 QA IR Tutorial
1/27
UDA Ubiquitous Digital Agents
Architectures Machines And Devices forEfficient Ubiquitous SystemsAMADEUS
Architectures Machines And Devices forEfficient Ubiquitous Systems
AMADEUS TUTORIAL
Combining Question Answering
andInformation Retrieval
Suresh Manandhar and Thimal Jayasooriya
University of York
UDA Ubiquitous Digital Agents
-
8/14/2019 QA IR Tutorial
2/27
UDA Ubiquitous Digital Agents
Search Engines
Question: How tall is Mt Everest?
IR could give following answers:
Mt Everest was first climbed by Hilary
Mt Everest is part of the Himalayan range
Susan Armstrong the 28 yr old rep from New Yorkclimbed the 8800m high Mt Everest
plus a large number of irrelevant answers
Search engines are not so good at reasoning with syntaxand semantics
-
8/14/2019 QA IR Tutorial
3/27
UDA Ubiquitous Digital Agents
Beyond Search Engines
Want shallow understanding of Natural Language tosupport a range of applications: Document clustering
Topic based searching
Question based searching
Interfacing with DB backends
Linking multiple related documents
Intelligent searching to aid
scientific discovery, document organisation, automaticextraction of knowledge from textual data
-
8/14/2019 QA IR Tutorial
4/27
-
8/14/2019 QA IR Tutorial
5/27
UDA Ubiquitous Digital Agents
Part of Speech (POS) tagging
PoS tagging is the pre-step to syntactic analysis
Given I went to the bank output:
I-pronoun went-verb-past the-det bank-noun-sg
Notice bank can be verb/noun
I can be numeric or pronoun And there can be unknown words:
I went to Pokhara
Task of PoS tagger is to assign the correct PoS tags
-
8/14/2019 QA IR Tutorial
6/27
UDA Ubiquitous Digital Agents
POS tagging
current taggers employ HMMs trained on large corpora
trigram-based p(t0 | t-2 t-1) taggers are the current state-
of-the-art - accuracy > 96%
Q-A systems: lots of unknown words or known words used
unusuallye.g. what i want is ... ?
low tagging accuracy on unknown words - best systems ~
84% to 88%
far less tagged data on questions
-
8/14/2019 QA IR Tutorial
7/27
UDA Ubiquitous Digital Agents
Parsing and Grammar formalisms
constructing canonical logical forms from sentences a good grammatical formalism will allow mapping:
John bought a toy / A toy was bought by John
to roughly the same semantic representation
exists(Y): toy(Y) & buy(John,Y)
common syntactic phenomena include:
relative clauses The man that Mary liked went home.
co-ordinations Bill and Mary got married. question constructions What book did John buy
-
8/14/2019 QA IR Tutorial
8/27
UDA Ubiquitous Digital Agents
Parsing Pure CFGs not sufficient
most automatically extracted grammars employ pureCFGs
need information on moved phrases to generatesemantics
dependency information is crucial
most machine learnt grammars focus on raw crossing
bracket rates
-
8/14/2019 QA IR Tutorial
9/27
UDA Ubiquitous Digital Agents
Parsing Issues: PP attachment ambiguity
prepositional phrase attachment ambiguity
e.g. I drank scotch on ice
better to leave PPs unattached rather than guessingwrong
combine both position-based and meaning-basedmatching
hybrid representations that combine logical form,syntactic structure and string/position basedrepresentations needed
-
8/14/2019 QA IR Tutorial
10/27
UDA Ubiquitous Digital Agents
Recollect: Search Engines
Question: How tall is Mt Everest?
IR could give following answers:
Mt Everest was first climbed by Hilary
Mt Everest is part of the himalayan range Susan Amstrong the 28 yr old rep from New York
climbed the 8800m high Mt Everest
plus a large number of irrelevant answers
Search engines are not so good at reasoning withsyntax and semantics
-
8/14/2019 QA IR Tutorial
11/27
UDA Ubiquitous Digital Agents
Matching: Getting the right answer
Matching answers with questions
Susan Amstrong the 28 yr old rep from New Yorkclimbed the 8800m high Mt Everest
Reasoning using logical form and lexical relations:
Meaning representation:X = Mt_Everest & tall(X, ?Y)
How tall asking for a numeric measure tall is related to height/high
the 8800m high Mt Everest:
high(X,Y) & Y=8800m & X = Mt_Everest
-
8/14/2019 QA IR Tutorial
12/27
UDA Ubiquitous Digital Agents
Matching Lexical Relations
Lexical relations are semantic relations betweenwords:
Synonym : (human person)
Antonym: (tall short) Hyponym: (BMW car)
Meronym: (door house)
Entailment: (fire smoke)
.. plus many more
Matching algorithm computes the semantic distance
between the Question and the Answer.
-
8/14/2019 QA IR Tutorial
13/27
UDA Ubiquitous Digital Agents
Matching and reasoning
reasoning crucial to Q-A
WordNet provides:
hypernym, synonym, antonym, meronym, etc.
(but) common relations required for Q-A tasks missing:
noun-adjective (benefit-beneficial) verb-noun (punish-punishment)
entailment (penalty-punishment)
telic (hammer-break)
limited current research on learning of semanticrelations
-
8/14/2019 QA IR Tutorial
14/27
UDA Ubiquitous Digital Agents
Search engines
Current state of the art
Google uses backlinks to determine the most relevant
pages
Most of the other search engines use keyword
scanning techniques
Online directories, such as DMOZ, use human editors
to sort and rank content
-
8/14/2019 QA IR Tutorial
15/27
UDA Ubiquitous Digital Agents
Beyond Search Engines
Essential ingredients
A better syntactic understanding of documentcontents
More efficient means of grouping similar or related
elements together
A better understanding of relevance to the user A
better query interface ?
-
8/14/2019 QA IR Tutorial
16/27
UDA Ubiquitous Digital Agents
The ideal situation
Error free disambiguation of natural language indocuments
Categorization of documents by subject and intent of theauthor; rather than by scanned keywords
The opportunity to clarify the information needs of
individual users; to closer match what they want
-
8/14/2019 QA IR Tutorial
17/27
UDA Ubiquitous Digital Agents
Possible dimensions for queries
Who is the First Lord ofthe Treasury
of the United Kingdom ?
First Lords of the TreasuryFirst Lords of the Treasury Prime MinistersPrime Ministers
Head of State in the UKHead of State in the UK
-
8/14/2019 QA IR Tutorial
18/27
UDA Ubiquitous Digital Agents
Are Document Dimensions an answer ?
In datawarehousing terminology, a dimension is
a structure that categorizes data in order to enable end
users to answer questions
Charles Bachman urged programmers to think in terms ofmulti-dimensional space as far back as 1973
Moth experimented with document metadata indimensional space (2001 and 2003)
Roelleke used the accessibility dimension to determinerelevance within a document
-
8/14/2019 QA IR Tutorial
19/27
UDA Ubiquitous Digital Agents
How are dimensions created ?
Cleanse and tokenize source data
Shallow parse the source data to resolve some syntactic
ambiguity
Extract a series of unique terms, words or phrases
Determine the similarity between individual terms
Organize similar terms into dimensions, groups ofsemantically related elements
-
8/14/2019 QA IR Tutorial
20/27
UDA Ubiquitous Digital Agents
Analysing source documents
The source data was the sample newswire articles fromTREC-11, 3 gigabytes of XML formatted data, consistingof around 20,000 articles
Stripping XML formatting
Detecting sentence boundaries in articles
POS tagging individual sentences
Named Entity and Coreference annotation
-
8/14/2019 QA IR Tutorial
21/27
UDA Ubiquitous Digital Agents
NE annotation
Kenneth Joseph Lenihan, a New York researchsociologist who helped refine the scientific methodsused in criminology, died May 25 at his home inManhattan.
Named Entity(Person Name)Named Entity(Person Name)
Named Entity(Location name)Named Entity
(Location name)
Named Entity
(Temporal entity)
Named Entity
(Temporal entity)Named Entity
(Location name)
Named Entity
(Location name)
-
8/14/2019 QA IR Tutorial
22/27
UDA Ubiquitous Digital Agents
Semantic distance
Semantic distance uses the concept of relatedness, orthe semantic similarity between two lexical concepts
Grouping synonyms together seems intuitive.i.e.: humans people beings
But surprisingly, other lexical concepts such asmeronyms, hyponyms, hypernyms, troponyms and evenantonyms can also be semanticallyclose.
Different semantic distance algorithms for Wordnetquantify relatedness in different ways(Budanitsky2001)
-
8/14/2019 QA IR Tutorial
23/27
UDA Ubiquitous Digital Agents
Semantic distance continued
Dimensions are found by setting an inclusion distance, anexperimentally derived figure for semantic distance
The inclusion distance differs between algorithms; and cansometimes even differ depending on the dataset
All terms which are within the specified inclusion distanceare grouped in the same dimension
Terms within a dimension serve as a starting point forsearching related concepts
-
8/14/2019 QA IR Tutorial
24/27
I i h di i
-
8/14/2019 QA IR Tutorial
25/27
UDA Ubiquitous Digital Agents
Issues with dimensions
Search space explosion (at least thrice the number ofdocuments are returned)
Stored semantic knowledge is not sufficiently granular: duck has 85 different entries in Rogets thesaurus; 47
verb definitions, 21 noun definitions and 18 uses as anadjective.
However, the term database stores only the part of speechtag. Thus, all 47 uses of duck as a verb are clumpedtogether
Wordnet is not sufficiently rich in lexical relations, norsufficiently inclusive of modern language idioms
Th f d
-
8/14/2019 QA IR Tutorial
26/27
UDA Ubiquitous Digital Agents
The way forward
Adding Natural language processing techniques to searchis the answer
Processing capabilities allow NLP techniques to be included
without significant degradation of speed Richer lexicons and language resources are being developed
People are continually asking harder questions of availableinformation resources; keyword searches no longer satisfy
end users!
IBMs WebFountain and the MOMINIS research project are
two of several research initiatives to bringfocusedcrawling and natural language processing techniques tosearch
-
8/14/2019 QA IR Tutorial
27/27