qa ir tutorial

8/14/2019 QA IR Tutorial

1/27

UDA Ubiquitous Digital Agents

Architectures Machines And Devices forEfficient Ubiquitous SystemsAMADEUS

Architectures Machines And Devices forEfficient Ubiquitous Systems

AMADEUS TUTORIAL

Combining Question Answering

andInformation Retrieval

Suresh Manandhar and Thimal Jayasooriya

University of York



2/27


Search Engines

Question: How tall is Mt Everest?

IR could give following answers:

Mt Everest was first climbed by Hilary

Mt Everest is part of the Himalayan range

Susan Armstrong the 28 yr old rep from New Yorkclimbed the 8800m high Mt Everest

plus a large number of irrelevant answers

Search engines are not so good at reasoning with syntaxand semantics


3/27


Beyond Search Engines

Want shallow understanding of Natural Language tosupport a range of applications: Document clustering

Topic based searching

Question based searching

Interfacing with DB backends

Linking multiple related documents

Intelligent searching to aid

scientific discovery, document organisation, automaticextraction of knowledge from textual data


4/27


5/27


Part of Speech (POS) tagging

PoS tagging is the pre-step to syntactic analysis

Given I went to the bank output:

I-pronoun went-verb-past the-det bank-noun-sg

Notice bank can be verb/noun

I can be numeric or pronoun And there can be unknown words:

I went to Pokhara

Task of PoS tagger is to assign the correct PoS tags


6/27


POS tagging

current taggers employ HMMs trained on large corpora

trigram-based p(t0 | t-2 t-1) taggers are the current state-

of-the-art - accuracy > 96%

Q-A systems: lots of unknown words or known words used

unusuallye.g. what i want is ... ?

low tagging accuracy on unknown words - best systems ~

84% to 88%

far less tagged data on questions


7/27


Parsing and Grammar formalisms

constructing canonical logical forms from sentences a good grammatical formalism will allow mapping:

John bought a toy / A toy was bought by John

to roughly the same semantic representation

exists(Y): toy(Y) & buy(John,Y)

common syntactic phenomena include:

relative clauses The man that Mary liked went home.

co-ordinations Bill and Mary got married. question constructions What book did John buy


8/27


Parsing Pure CFGs not sufficient

most automatically extracted grammars employ pureCFGs

need information on moved phrases to generatesemantics

dependency information is crucial

most machine learnt grammars focus on raw crossing

bracket rates


9/27


Parsing Issues: PP attachment ambiguity

prepositional phrase attachment ambiguity

e.g. I drank scotch on ice

better to leave PPs unattached rather than guessingwrong

combine both position-based and meaning-basedmatching

hybrid representations that combine logical form,syntactic structure and string/position basedrepresentations needed


10/27


Recollect: Search Engines

Question: How tall is Mt Everest?

IR could give following answers:

Mt Everest was first climbed by Hilary

Mt Everest is part of the himalayan range Susan Amstrong the 28 yr old rep from New York

climbed the 8800m high Mt Everest

plus a large number of irrelevant answers

Search engines are not so good at reasoning withsyntax and semantics


11/27


Matching: Getting the right answer

Matching answers with questions

Susan Amstrong the 28 yr old rep from New Yorkclimbed the 8800m high Mt Everest

Reasoning using logical form and lexical relations:

Meaning representation:X = Mt_Everest & tall(X, ?Y)

How tall asking for a numeric measure tall is related to height/high

the 8800m high Mt Everest:

high(X,Y) & Y=8800m & X = Mt_Everest


12/27


Matching Lexical Relations

Lexical relations are semantic relations betweenwords:

Synonym : (human person)

Antonym: (tall short) Hyponym: (BMW car)

Meronym: (door house)

Entailment: (fire smoke)

.. plus many more

Matching algorithm computes the semantic distance

between the Question and the Answer.


13/27


Matching and reasoning

reasoning crucial to Q-A

WordNet provides:

hypernym, synonym, antonym, meronym, etc.

(but) common relations required for Q-A tasks missing:

noun-adjective (benefit-beneficial) verb-noun (punish-punishment)

entailment (penalty-punishment)

telic (hammer-break)

limited current research on learning of semanticrelations


14/27


Search engines

Current state of the art

Google uses backlinks to determine the most relevant

pages

Most of the other search engines use keyword

scanning techniques

Online directories, such as DMOZ, use human editors

to sort and rank content


15/27


Beyond Search Engines

Essential ingredients

A better syntactic understanding of documentcontents

More efficient means of grouping similar or related

elements together

A better understanding of relevance to the user A

better query interface ?


16/27


The ideal situation

Error free disambiguation of natural language indocuments

Categorization of documents by subject and intent of theauthor; rather than by scanned keywords

The opportunity to clarify the information needs of

individual users; to closer match what they want


17/27


Possible dimensions for queries

Who is the First Lord ofthe Treasury

of the United Kingdom ?

First Lords of the TreasuryFirst Lords of the Treasury Prime MinistersPrime Ministers

Head of State in the UKHead of State in the UK


18/27


Are Document Dimensions an answer ?

In datawarehousing terminology, a dimension is

a structure that categorizes data in order to enable end

users to answer questions

Charles Bachman urged programmers to think in terms ofmulti-dimensional space as far back as 1973

Moth experimented with document metadata indimensional space (2001 and 2003)

Roelleke used the accessibility dimension to determinerelevance within a document


19/27


How are dimensions created ?

Cleanse and tokenize source data

Shallow parse the source data to resolve some syntactic

ambiguity

Extract a series of unique terms, words or phrases

Determine the similarity between individual terms

Organize similar terms into dimensions, groups ofsemantically related elements


20/27


Analysing source documents

The source data was the sample newswire articles fromTREC-11, 3 gigabytes of XML formatted data, consistingof around 20,000 articles

Stripping XML formatting

Detecting sentence boundaries in articles

POS tagging individual sentences

Named Entity and Coreference annotation


21/27


NE annotation

Kenneth Joseph Lenihan, a New York researchsociologist who helped refine the scientific methodsused in criminology, died May 25 at his home inManhattan.

Named Entity(Person Name)Named Entity(Person Name)

Named Entity(Location name)Named Entity

(Location name)

Named Entity

(Temporal entity)

Named Entity

(Temporal entity)Named Entity

(Location name)

Named Entity

(Location name)


22/27


Semantic distance

Semantic distance uses the concept of relatedness, orthe semantic similarity between two lexical concepts

Grouping synonyms together seems intuitive.i.e.: humans people beings

But surprisingly, other lexical concepts such asmeronyms, hyponyms, hypernyms, troponyms and evenantonyms can also be semanticallyclose.

Different semantic distance algorithms for Wordnetquantify relatedness in different ways(Budanitsky2001)


23/27


Semantic distance continued

Dimensions are found by setting an inclusion distance, anexperimentally derived figure for semantic distance

The inclusion distance differs between algorithms; and cansometimes even differ depending on the dataset

All terms which are within the specified inclusion distanceare grouped in the same dimension

Terms within a dimension serve as a starting point forsearching related concepts


24/27

I i h di i


25/27


Issues with dimensions

Search space explosion (at least thrice the number ofdocuments are returned)

Stored semantic knowledge is not sufficiently granular: duck has 85 different entries in Rogets thesaurus; 47

verb definitions, 21 noun definitions and 18 uses as anadjective.

However, the term database stores only the part of speechtag. Thus, all 47 uses of duck as a verb are clumpedtogether

Wordnet is not sufficiently rich in lexical relations, norsufficiently inclusive of modern language idioms

Th f d


26/27


The way forward

Adding Natural language processing techniques to searchis the answer

Processing capabilities allow NLP techniques to be included

without significant degradation of speed Richer lexicons and language resources are being developed

People are continually asking harder questions of availableinformation resources; keyword searches no longer satisfy

end users!

IBMs WebFountain and the MOMINIS research project are

two of several research initiatives to bringfocusedcrawling and natural language processing techniques tosearch


27/27

qa ir tutorial

Documents