watson @ rpi

47
WATSON @ RPI PROFESSOR JIM HENDLER SIMON ELLIS KATE MCGUIRE NICOLE NEGEDLY AVI WEINSTOCK MATT KLAWONN JENN CHAN SARABETH JAFFE WATSON TECHNOLOGIES AND OPEN ARCHITECTURE QUESTION ANSWERING INSIDE DEEPQA Managing complex unstructured data with UIMA Simon Ellis 22 nd November, 2013

Upload: betha

Post on 21-Jan-2016

108 views

Category:

Documents


0 download

DESCRIPTION

WATSON @ RPI. Watson Technologies a nd Open Architecture Question Answering. Professor Jim Hendler Simon Ellis Kate McGuire  Nicole Negedly Avi Weinstock  Matt Klawonn Jenn Chan  Sarabeth Jaffe. Introduction. IBM Watson. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WATSON  @  RPI

WATSON @ RPI

PROFESSOR JIM HENDLERSIMON ELLIS

KATE MCGUIRE NICOLE NEGEDLYAVI WEINSTOCK MATT KLAWONN JENN CHAN SARABETH JAFFE

WATSON TECHNOLOGIESAND

OPEN ARCHITECTURE QUESTION ANSWERING

INSIDE DEEPQAManaging complex unstructured data with UIMA

Simon Ellis

22nd November, 2013

Page 2: WATSON  @  RPI

WATSON RPI

INTRODUCTION

Page 3: WATSON  @  RPI

???IBM Watson

Page 4: WATSON  @  RPI

???Watson is… … a piece of software that will run on your laptop

Though very slowly Specialised hardware and control platform

… an implementation of the DeepQA concept

… the first iteration of the ‘cognitive computing’ platform

… a very clever artificial intelligence A very clever application of human intelligence

Page 5: WATSON  @  RPI

???Inside Watson

Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2

Page 6: WATSON  @  RPI

WATSON RPI

Nicole Negedly

QUESTION ANALYSIS

Page 7: WATSON  @  RPI

???Question Analysis

Page 8: WATSON  @  RPI

???Question analysis

What is the question asking for?

Which terms in the question refer to the answer?

Given any natural language question, how can Watson accurately discover this information?

Who is the president of Rensselaer Polytechnic Institute?

Focus Terms: “Who”, “president of Rensselaer

Polytechnic Institute”

Answer Types: Person, President

QuestionAnalysis

Page 9: WATSON  @  RPI

???Parsing and semantic analysis

What information about a previously unseen piece of English text can Watson determine?

How is this information useful?

Natural Language Parsing Semantic Analysis

- grammatical structure

- parts of speech

- relationships between words

- ...etc.

- meanings of words, phrases, etc.

- synonyms, entailment

- hypernyms, hyponyms

- ...etc.

Page 10: WATSON  @  RPI

???Parsing

Stanford’s NLP toolset is used

Page 11: WATSON  @  RPI

???Semantic relations in WordNet

Princeton University’s WordNet

Words are grouped into groups of synonyms called synsets

Relationships exist between noun synsets hypernym/hyponym: type-of relation

e.g. Canine is a hypernym of dog

holonym/meronym: part-of relation e.g. Building is a holonym of window

Page 12: WATSON  @  RPI

???How is this useful?

This information can be used to “understand” a question

Current Question Analysis work with RPI’s version of Watson Creating and training machine learning classifiers

Parse TreesDependency Relations

CoreferencesNamed Entities

Semantic Relations

Classifiers

Manually AnnotatedQuestions

New QuestionCritical Elements

of Question

Page 13: WATSON  @  RPI

???Question analysis pipeline

UnstructuredQuestion Text

Parsing&

SemanticAnalysis

MachineLearning

Classifiers

Structured Annotationsof Question:

Focus, answer types, Useful search queries

Page 14: WATSON  @  RPI

WATSON RPI

Kate McGuire

CANDIDATE GENERATION

Page 15: WATSON  @  RPI

???Search Result Processing and Candidate Generation

Page 16: WATSON  @  RPI

???Primary Search

Primary Search is used to generate our corpus of information from which to take candidate answers, passages, supporting evidence, and essentially all textual input to the system

It formulates queries based on the results of Question Analysis

These queries are passed into a search engine which returns a set number of highly relevant documents and their ranks.

Page 17: WATSON  @  RPI

???Search Result Processing

Search Result Processing restructures the information in the document so it is useful. HTML tags are cleaned from the document Passage Retrieval/Chunking

Breaks the document down into smaller pieces Adds information, such as the html text, length, place in the

document, etc.

Passage Parsing Parse trees are formed for each passage

Page 18: WATSON  @  RPI

???Candidate Generation

Candidate Generation generates a wide net of possible answers for the question from each document.

Using each document, and the passages created by Search Result Processing, we generate candidates using three techniques: Title of Document (T.O.D.): Adds the title of the

document as a candidate. Wikipedia Title Candidate Generation: Adds any noun

phrases within the document’s passage texts that are also the titles of Wikipedia articles.

Anchor Text Candidate Generation: Adds candidates based on the hyperlinks and metadata within the document.

Page 19: WATSON  @  RPI

???Search Result Processing andCandidate Generation

Page 20: WATSON  @  RPI

WATSON RPI

Matt Klawonn

SCORING & RANKING

Page 21: WATSON  @  RPI

???Scoring & Ranking

Page 22: WATSON  @  RPI

???Scoring

Analyzes how well a candidate answer relates to the question

Two basic types of scoring algorithm Context-independent scoring Context-dependent scoring

Page 23: WATSON  @  RPI

???Types of scorers

Context-independent Question Analysis Ontologies (DBpedia, YAGO, etc) Reasoning

Context-dependent Analyzes natural language that candidates appear in Relies on “passages” found during search

Page 24: WATSON  @  RPI

???Scorers

Examples of scorers include Passage Term Match Textual Alignment Skip-Bigram

Each of these scores supportive evidence

Scores are then merged to produce a single candidate score

Page 25: WATSON  @  RPI

???Inside Watson

Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2

Page 26: WATSON  @  RPI

WATSON RPI

Simon Ellis

THE TAO OF UIMA

Page 27: WATSON  @  RPI

???UIMA

‘Unstructured Information Management Architecture’

A platform for the analysis of unstructured information and its integration with search technologies

Permits multi-modal analysis of collections or archives

Page 28: WATSON  @  RPI

???UIMA

http://uima.apache.org/d/uimaj-2.4.0/

Page 29: WATSON  @  RPI

???‘Unstructured information’

The most rapidly-growing source of information in existence The internet Print media Video recordings Audio recordings ...

“Unstructured information is just information that doesn’t have the kind of structure you need it to have for what you’re doing.” [Peter Fox, X-Informatics class]

Page 30: WATSON  @  RPI

???UIMA (again)

The UIMA platform can be thought of in four ways:

A specification for component interfaces for, and in, an analytics pipeline

A specification of certain design patterns for that pipeline

An outline of 2 data representations: in-memory annotations for local analysis and XML representation for remote web integration

An outline for possible development roles allowing tools to be used by users with a wide range of skills

Page 31: WATSON  @  RPI

???CAS

Common Analysis Structure (CAS) Object-based structure Allows representation of objects, properties and values Stores arbitrary data structures

Annotations Types

Object types may be related by single-inheritance Contains document being analysed, either physically

or logically

Results of analysis are shared and recorded in a CAS

Page 32: WATSON  @  RPI

???Annotator

Core UIMA component type

Contains analysis algorithms designed to work on data contained in a CAS Original document Annotation Search evidence Candidate score ...

Form the building blocks of Analysis Engines

Page 33: WATSON  @  RPI

???Analysis Engine

Building blocks of a UIMA pipeline

Section of code containing 1 or more annotators

Analyses source document(s) and provides analysis results Results typically represent metadata about the source

Analysis Engines are effectively software agents that discover and record metadata

Page 34: WATSON  @  RPI

???Example

http://uima.apache.org/d/uimaj-2.4.0/

Page 35: WATSON  @  RPI

???Sofas and CAS Views

Sofa Subject of Analysis A piece of data intended for analysis by UIMA

components

CAS View A section of a CAS dedicated to one Sofa Shares the same name as its Sofa May be dynamically created as needed by applications

or AEs

Each Sofa permits a different perspective of an artefact

Page 36: WATSON  @  RPI

???Example

Dr Shirley Ann Jackson

Teacher of physics

President, RPI

Researcher at Bell Labs

IBM Board of Directors

Chairman, USNRC

Page 37: WATSON  @  RPI

???Descriptors

All components consist of two parts Code Descriptor (declaration)

Functions of the descriptor Contains metadata about the code block

Name Structure Behaviour

Used in component discovery, reuse, and tool composition

Page 38: WATSON  @  RPI

???UIMA (again, again)

Highly reliant on XML Flexible Extensible

XML... ... describes components and their behaviour ... controls data (CAS) flow through the pipeline ... is used to create larger components from

subcomponents Aggregate Analysis Engines

Page 39: WATSON  @  RPI

???Aggregate Analysis Engine

A complex analysis engine made up of other components May contain simple AEs or other AAEs Components further down the pipeline may rely on all

output Performs a larger, complete task, e.g. named entity

recognition language detection and tokenisation part-of-speech detection deep grammatical parsing named entity recognition

Page 40: WATSON  @  RPI

???CAS Multiplier

Creates 0 or more new CAS objects from an input CAS

May be used to duplicate or merge CAS objects e.g....

... creating alternative versions of an input Sofa ... breaking a large input CAS into multiple smaller

pieces ... aggregating multiple input CAS into a single output

Page 41: WATSON  @  RPI

???Inside Watson

Watson pipeline as published by IBM; see IBM J Res & Dev 56 (3/4), May/July 2012, p. 15:2

Page 42: WATSON  @  RPI

???UIMA, once more

UIMA runs in the Java Runtime Environment Uses XML code to run system UIMA framework reads XML dynamically and

creates objects using them Only the UIMA framework itself is compiled

SO HOW DOES IT WORK?

Page 43: WATSON  @  RPI

???How it works

Abstract class prototyping UIMA Framework objects are usually derived from a

base class

Function signature UIMA Framework objects each have certain functions

which can or must be overridden initialize() process()

This ensures all classes are of known supertypes and have a recognisable function signature for all key functions

Page 44: WATSON  @  RPI

???How it works

Reflection The ability of a computer program to examine and

modify the structure and behavior (specifically the values, meta-data, properties and functions) of an object at runtime.

XML descriptors define the nature of objects class name constructor parameters ...

UIMA dynamically creates objects using reflection

Page 45: WATSON  @  RPI

???The ‘magic code’

// create type of obj we want

JCasAnnotator ann = null;

// use Java inbuilt function to create abstract class

Class annClass = Class.forName("com.ibm.tutorial.tycor");

// get constructors for abstract class type

Constructor cons = annClass.getConstructor(<params>);

// should return a JCasAnnotator object

ann = cons.newInstance(<params>);

Page 46: WATSON  @  RPI

???UIMA, finally

Effectively an interpreter for code ‘scripted’ in XML and Java

Component-oriented design makes scaling easy BlueJ (Jeopardy! hardware) had ≫ 2,000 cores

Most easily written in Java Java runs in the Java Runtime Environment Dynamic typing & reflection are therefore possible Could not have been written in C++08

An OS for multimodal, unstructured information management

Page 47: WATSON  @  RPI

WATSON RPI

QUESTIONS & ANSWERS