qa ir tutorial

Upload: nasim09

Post on 30-May-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 QA IR Tutorial

    1/27

    UDA Ubiquitous Digital Agents

    Architectures Machines And Devices forEfficient Ubiquitous SystemsAMADEUS

    Architectures Machines And Devices forEfficient Ubiquitous Systems

    AMADEUS TUTORIAL

    Combining Question Answering

    andInformation Retrieval

    Suresh Manandhar and Thimal Jayasooriya

    University of York

    UDA Ubiquitous Digital Agents

  • 8/14/2019 QA IR Tutorial

    2/27

    UDA Ubiquitous Digital Agents

    Search Engines

    Question: How tall is Mt Everest?

    IR could give following answers:

    Mt Everest was first climbed by Hilary

    Mt Everest is part of the Himalayan range

    Susan Armstrong the 28 yr old rep from New Yorkclimbed the 8800m high Mt Everest

    plus a large number of irrelevant answers

    Search engines are not so good at reasoning with syntaxand semantics

  • 8/14/2019 QA IR Tutorial

    3/27

    UDA Ubiquitous Digital Agents

    Beyond Search Engines

    Want shallow understanding of Natural Language tosupport a range of applications: Document clustering

    Topic based searching

    Question based searching

    Interfacing with DB backends

    Linking multiple related documents

    Intelligent searching to aid

    scientific discovery, document organisation, automaticextraction of knowledge from textual data

  • 8/14/2019 QA IR Tutorial

    4/27

  • 8/14/2019 QA IR Tutorial

    5/27

    UDA Ubiquitous Digital Agents

    Part of Speech (POS) tagging

    PoS tagging is the pre-step to syntactic analysis

    Given I went to the bank output:

    I-pronoun went-verb-past the-det bank-noun-sg

    Notice bank can be verb/noun

    I can be numeric or pronoun And there can be unknown words:

    I went to Pokhara

    Task of PoS tagger is to assign the correct PoS tags

  • 8/14/2019 QA IR Tutorial

    6/27

    UDA Ubiquitous Digital Agents

    POS tagging

    current taggers employ HMMs trained on large corpora

    trigram-based p(t0 | t-2 t-1) taggers are the current state-

    of-the-art - accuracy > 96%

    Q-A systems: lots of unknown words or known words used

    unusuallye.g. what i want is ... ?

    low tagging accuracy on unknown words - best systems ~

    84% to 88%

    far less tagged data on questions

  • 8/14/2019 QA IR Tutorial

    7/27

    UDA Ubiquitous Digital Agents

    Parsing and Grammar formalisms

    constructing canonical logical forms from sentences a good grammatical formalism will allow mapping:

    John bought a toy / A toy was bought by John

    to roughly the same semantic representation

    exists(Y): toy(Y) & buy(John,Y)

    common syntactic phenomena include:

    relative clauses The man that Mary liked went home.

    co-ordinations Bill and Mary got married. question constructions What book did John buy

  • 8/14/2019 QA IR Tutorial

    8/27

    UDA Ubiquitous Digital Agents

    Parsing Pure CFGs not sufficient

    most automatically extracted grammars employ pureCFGs

    need information on moved phrases to generatesemantics

    dependency information is crucial

    most machine learnt grammars focus on raw crossing

    bracket rates

  • 8/14/2019 QA IR Tutorial

    9/27

    UDA Ubiquitous Digital Agents

    Parsing Issues: PP attachment ambiguity

    prepositional phrase attachment ambiguity

    e.g. I drank scotch on ice

    better to leave PPs unattached rather than guessingwrong

    combine both position-based and meaning-basedmatching

    hybrid representations that combine logical form,syntactic structure and string/position basedrepresentations needed

  • 8/14/2019 QA IR Tutorial

    10/27

    UDA Ubiquitous Digital Agents

    Recollect: Search Engines

    Question: How tall is Mt Everest?

    IR could give following answers:

    Mt Everest was first climbed by Hilary

    Mt Everest is part of the himalayan range Susan Amstrong the 28 yr old rep from New York

    climbed the 8800m high Mt Everest

    plus a large number of irrelevant answers

    Search engines are not so good at reasoning withsyntax and semantics

  • 8/14/2019 QA IR Tutorial

    11/27

    UDA Ubiquitous Digital Agents

    Matching: Getting the right answer

    Matching answers with questions

    Susan Amstrong the 28 yr old rep from New Yorkclimbed the 8800m high Mt Everest

    Reasoning using logical form and lexical relations:

    Meaning representation:X = Mt_Everest & tall(X, ?Y)

    How tall asking for a numeric measure tall is related to height/high

    the 8800m high Mt Everest:

    high(X,Y) & Y=8800m & X = Mt_Everest

  • 8/14/2019 QA IR Tutorial

    12/27

    UDA Ubiquitous Digital Agents

    Matching Lexical Relations

    Lexical relations are semantic relations betweenwords:

    Synonym : (human person)

    Antonym: (tall short) Hyponym: (BMW car)

    Meronym: (door house)

    Entailment: (fire smoke)

    .. plus many more

    Matching algorithm computes the semantic distance

    between the Question and the Answer.

  • 8/14/2019 QA IR Tutorial

    13/27

    UDA Ubiquitous Digital Agents

    Matching and reasoning

    reasoning crucial to Q-A

    WordNet provides:

    hypernym, synonym, antonym, meronym, etc.

    (but) common relations required for Q-A tasks missing:

    noun-adjective (benefit-beneficial) verb-noun (punish-punishment)

    entailment (penalty-punishment)

    telic (hammer-break)

    limited current research on learning of semanticrelations

  • 8/14/2019 QA IR Tutorial

    14/27

    UDA Ubiquitous Digital Agents

    Search engines

    Current state of the art

    Google uses backlinks to determine the most relevant

    pages

    Most of the other search engines use keyword

    scanning techniques

    Online directories, such as DMOZ, use human editors

    to sort and rank content

  • 8/14/2019 QA IR Tutorial

    15/27

    UDA Ubiquitous Digital Agents

    Beyond Search Engines

    Essential ingredients

    A better syntactic understanding of documentcontents

    More efficient means of grouping similar or related

    elements together

    A better understanding of relevance to the user A

    better query interface ?

  • 8/14/2019 QA IR Tutorial

    16/27

    UDA Ubiquitous Digital Agents

    The ideal situation

    Error free disambiguation of natural language indocuments

    Categorization of documents by subject and intent of theauthor; rather than by scanned keywords

    The opportunity to clarify the information needs of

    individual users; to closer match what they want

  • 8/14/2019 QA IR Tutorial

    17/27

    UDA Ubiquitous Digital Agents

    Possible dimensions for queries

    Who is the First Lord ofthe Treasury

    of the United Kingdom ?

    First Lords of the TreasuryFirst Lords of the Treasury Prime MinistersPrime Ministers

    Head of State in the UKHead of State in the UK

  • 8/14/2019 QA IR Tutorial

    18/27

    UDA Ubiquitous Digital Agents

    Are Document Dimensions an answer ?

    In datawarehousing terminology, a dimension is

    a structure that categorizes data in order to enable end

    users to answer questions

    Charles Bachman urged programmers to think in terms ofmulti-dimensional space as far back as 1973

    Moth experimented with document metadata indimensional space (2001 and 2003)

    Roelleke used the accessibility dimension to determinerelevance within a document

  • 8/14/2019 QA IR Tutorial

    19/27

    UDA Ubiquitous Digital Agents

    How are dimensions created ?

    Cleanse and tokenize source data

    Shallow parse the source data to resolve some syntactic

    ambiguity

    Extract a series of unique terms, words or phrases

    Determine the similarity between individual terms

    Organize similar terms into dimensions, groups ofsemantically related elements

  • 8/14/2019 QA IR Tutorial

    20/27

    UDA Ubiquitous Digital Agents

    Analysing source documents

    The source data was the sample newswire articles fromTREC-11, 3 gigabytes of XML formatted data, consistingof around 20,000 articles

    Stripping XML formatting

    Detecting sentence boundaries in articles

    POS tagging individual sentences

    Named Entity and Coreference annotation

  • 8/14/2019 QA IR Tutorial

    21/27

    UDA Ubiquitous Digital Agents

    NE annotation

    Kenneth Joseph Lenihan, a New York researchsociologist who helped refine the scientific methodsused in criminology, died May 25 at his home inManhattan.

    Named Entity(Person Name)Named Entity(Person Name)

    Named Entity(Location name)Named Entity

    (Location name)

    Named Entity

    (Temporal entity)

    Named Entity

    (Temporal entity)Named Entity

    (Location name)

    Named Entity

    (Location name)

  • 8/14/2019 QA IR Tutorial

    22/27

    UDA Ubiquitous Digital Agents

    Semantic distance

    Semantic distance uses the concept of relatedness, orthe semantic similarity between two lexical concepts

    Grouping synonyms together seems intuitive.i.e.: humans people beings

    But surprisingly, other lexical concepts such asmeronyms, hyponyms, hypernyms, troponyms and evenantonyms can also be semanticallyclose.

    Different semantic distance algorithms for Wordnetquantify relatedness in different ways(Budanitsky2001)

  • 8/14/2019 QA IR Tutorial

    23/27

    UDA Ubiquitous Digital Agents

    Semantic distance continued

    Dimensions are found by setting an inclusion distance, anexperimentally derived figure for semantic distance

    The inclusion distance differs between algorithms; and cansometimes even differ depending on the dataset

    All terms which are within the specified inclusion distanceare grouped in the same dimension

    Terms within a dimension serve as a starting point forsearching related concepts

  • 8/14/2019 QA IR Tutorial

    24/27

    I i h di i

  • 8/14/2019 QA IR Tutorial

    25/27

    UDA Ubiquitous Digital Agents

    Issues with dimensions

    Search space explosion (at least thrice the number ofdocuments are returned)

    Stored semantic knowledge is not sufficiently granular: duck has 85 different entries in Rogets thesaurus; 47

    verb definitions, 21 noun definitions and 18 uses as anadjective.

    However, the term database stores only the part of speechtag. Thus, all 47 uses of duck as a verb are clumpedtogether

    Wordnet is not sufficiently rich in lexical relations, norsufficiently inclusive of modern language idioms

    Th f d

  • 8/14/2019 QA IR Tutorial

    26/27

    UDA Ubiquitous Digital Agents

    The way forward

    Adding Natural language processing techniques to searchis the answer

    Processing capabilities allow NLP techniques to be included

    without significant degradation of speed Richer lexicons and language resources are being developed

    People are continually asking harder questions of availableinformation resources; keyword searches no longer satisfy

    end users!

    IBMs WebFountain and the MOMINIS research project are

    two of several research initiatives to bringfocusedcrawling and natural language processing techniques tosearch

  • 8/14/2019 QA IR Tutorial

    27/27