ontology-guided information extraction from unstructured textsemantix/sushain_thesis... ·...

Iowa State University Department of Computer Science

Copyright © Sushain Pandit, 2010.

Sushain Pandit

Department of Computer Science

Iowa State University

Ames, Iowa, USA

M.S. Thesis Defense

April 29, 2010

Ontology-guided Extraction of Structured Information from

Unstructured Text:

Identifying and Capturing Complex Relationships

1/63


Introduction

Motivation

Contributions

Problem

Related Concepts

Definition

Approach

Composite Extraction Framework

Semantic Validation Framework

Representation Framework

Evaluation

SEMANTIXS Architecture

Experimental Results and Analysis

Conclusion

Summary

Further Work


Outline

2/63


Introduction

Motivation

Contributions


Outline

3/63



Introduction Motivation

Information Extraction from Text

Process of extracting interesting information from unstructured text

Entities – Persons, Organizations, Locations, etc

Attributes – Name, Descriptors, Categories, etc

Events – Company established in 2010

Relationships – Person works for Organization

Co-references – IBM and International Business Machines

….

Our Focus – A subset of complex nested relationships

4/63




Motivating Example - A Moderately Complex Sentence

Sports News predicts that Sachin Tendulkar may score a double-hundred with high probability

and retire in 2015

5/63




Motivating Example – Usual Information Extraction Scene

Sports News predicts that Sachin Tendulkar may score a double-hundred with high probability

and retire in 2015

EntityEntities

Event

Relationship

Attribute

6/63




Motivating Example – Nested Relationships

Sports

News

predicts

that

Sachin Tendulkar may score a

double-hundred

Clause-level DependencyA Qualifying Modifier

Outer Relationship

Dependency

with high

probability

Inner Clause subject to the

Qualifying Modifier

andretire in

2015

Conjunction creating

dependencies between parts of the

sentence

Left part governing the meaning of

right part

7/63




Motivating Example – Domain Description

Sports

News

predicts

that


double-hundred

with high

probabilityand

retire in

2015

Internationally recognized

Sportsperson

Or

Someone else by the same

name?

8/63




Motivating Example – Domain Description

Sports

News

predicts

that


double-hundred

with high

probabilityand

retire in

2015

Domain Description in the form of a domain ontology

Sachin_Tendulkar type SportsPerson

Internationally recognized

Sportsperson

9/63




Motivating Example – Representation

Sports

News

predicts

that

Sachin_Tendulkar

high

probability

retire

scores

double-hundred

predicts

that

2015

Sachin_Tendulkar

Some Semantic Graph

Formalism that can Capture

the Structured Information

10/63




Existing Approaches for Information Extraction

Rule-based Approaches

Laborious but transparent in capturing complex semantic criteria

Best performing systems invariably use hand-crafted rules

Often rely on domain-specific trigger words

Automatic pattern induction (statistical methods)

Co-occurrence – Require lot of labeled text corpora

Cluster Analysis – Require computational cost

Comprehensive surveys

N. Bach and S. Badaskar, 2007; G. Neumann and F. Xu, 2004

Recall: Our Focus – A subset of complex nested relationships

Our Approach – Domain independent rule formulation

11/63



Introduction Contributions

Contributions

A modular ontology-based approach for extraction of a subset of nested-

complex relationships that decouples domain-specific knowledge from

the rules used for information extraction

A framework to semantically represent the extracted relationships in the

form of query-able RDF graphs

Provide open-source implementation of SEMANTIXS, a system for

ontology-guided extraction of structured information from text

Report results of some experiments to validate the proposed approach

12/63


Problem

Related Concepts

Definition


Outline

13/63



Problem Related Concepts

Parse Trees

Ordered and rooted trees representing the syntactic structure

Penn Treebank1 notation for tagging the sentence

S: Simple declarative clause

NP: Categorizes all constituents depending on a head noun.

VP: Categorizes all constituents headed a verb.

1 Refer - ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/notation.tex for a complete list

Example

S

NP VP

Heart Attack Causes reduced

averagelifespan

NP

14/63




Dependency Graphs

Structures capturing implicit dependencies (sentential semantics)

between the tokens of a sentence

Stanford dependency notation2 for labeling the graph

2 Refer Handout. URL for a complete list - http://nlp.stanford.edu/software/dependencies manual.pdf

Example

15/63




Formal Specifications for Validation and Representation

Domains generally described by concepts, relationships and instances

Need for a formalism to capture the domain description

Need for a suitable representation mechanism

An ontology is a structure O=(R,C) such that:

The sets R and C are disjoint and their elements are called relations

and concepts respectively

The elements in R induce a strict partial order on the elements in C

O = {{SportsPerson, Person, Number}, {scoredRuns}}

Domain (scoredRuns) = {SportsPerson}, Range (scoredRuns) = Number

Ontology

16/63




Domain Ontology with Instances

A domain ontology with instances is a structure DOI=(O,I,h) such that:

I is a set, whose elements are called instances

There exists a function h:I ! P(C), where P(C) is the power-set of

the set of concepts for the ontology O

Example

O = {{SportsPerson, Person, Number}, {scoredRuns}}


I = {John, Steve, Sachin_Tendulkar}

h (Sachin_Tendulkar ) = {SportsPerson, Person}

h (John) = {Person}

17/63




Resource Description Framework (RDF)

Example – RDF Triple

Resources described using properties and values using RDF statements

Statements represented as RDF triples, consisting of a subject,

predicate and object

Unique Resource Identified (URI) for Resources

RDF Reification – Special mechanism to make assertions about

statements instead of entities

Sachin Tendulkar scored 200 runs

18/63



Problem Definition

Ontology-guided Structured Information Extraction

Given:

Text fragment consisting of sentences {Ti}

Domain ontology with instances DOI=(O,I,h)

Ontology-guided structured information extraction:

Determines a set TCTR of candidate information constructs

using entity and relationship extraction algorithm(s)

Validates TCTR with respect to DOI and finds a set K of

validated information constructs

Represent triples in K using a suitable mechanism

Remainder of the presentation – Details of the above three steps

19/63


Approach



Outline

20/63



Recall: Motivating Example

Sports

News

predicts

that


double-hundred

Clause-level DependencyA Qualifying Modifier

Outer Relationship

Dependency

with high

probability


Qualifying Modifier

andretire in

2015



sentence


right part

Approach Composite Extraction Framework

21/63




Terminology -Extraction Rule

Rule –

“Label(s) from {nn, amod} occur along the edges connected to an nsubj node”

! “Group the associated nodes with the nsubj node”

“Labels nsubj & dobj occur along a set of adjacent edges” ! “Extract the nodes

associated with those edges as information constructs”

Result – Extraction of {{Heat, attack, causes}, reduced, {average, lifespan}} as

candidate information construct

22/63



Motivating Example: Identifying Sub-problems

Sports

News

predicts

that


double-hundred

Clause-level Dependency

Outer Relationship

Dependency


Qualifying Modifier


23/63




Identifying Complex Relationship - Type 1

Relationships with Internal Clauses:

Variants:

That Macs are too cool for its customers, says Microsoft ad

Microsoft ad says: Macs are too cool for its customers

24/63




Rules for Entity & Relation Extraction – Type 1

Dependency Graph:

25/63





Expected Extraction Rule Behavior:

Clausal Complement – ccomp

Variants - parataxis

Leave this for later – Recursively

reduced to one of the other types

26/63





double-hundred

A Qualifying Modifier

with high

probability


27/63




Identifying Complex Relationship - Type 2

Relationships with Qualifiers:

Variants:

With high probability, Sachin Tendulkar may score a double-hundred.

There is a high probability that Sachin Tendulkar may score a double-hundred

28/63





Dependency Graph:

29/63





Expected Extraction Rule Behavior:

Prepositional Modifier – prep

Variants – prep_xxx

Adjectival Modifier – amod

Prep – amod Pattern Identifies

this Relationship Type

30/63





double-hundredand

retire in

2015



sentence


right part


31/63




Identifying Complex Relationship – Type 3

Relationships with Conjunctions:

32/63





Conjunctions connect parts having immediate dependencies

Reference resolution required between the parts

Utilize Sentence Parses instead of dependency graphs

Formulation:

If right-part contains Simple Declarative Clause (S), process as a

distinct sentence

If right-part contains Verb and Noun Phrases (VP, NP), use the subject

of left-part and process as a distinct sentence

If right-part contains only NP, use the subject and object of left-part

and process as a distinct sentence

33/63





double-hundred


34/63




Rules for Entity & Relation Extraction – Simple

Dependency Graph

35/63




Extraction Algorithm - Illustration

Sports News predicts that

Sachin Tendulkar may score a double-hundred

with high probability

and retire in 2015

36/63





{Sports News, predicts}



and

retire in 2015

37/63




Extraction Algorithm – Illustration



38/63





retire in 2015

Append – Sachin Tendulkar

Sachin Tendulkar retire in 2015

39/63





retire in 2015

Append – Sachin Tendulkar

Sachin Tendulkar retire in 2015

40/63





{Sports News, predicts

{ Sachin Tendulkar, scored,

double hundred, probability, high } }

{ Sports News, predicts

{ Sachin Tendulkar, retire, 2015 } }

41/63


Approach


Semantic Validation Framework

Representation Framework


Outline

42/63



Approach Validation Framework

Validation using Domain Ontology

Extracted Information Constructs to be matched against the domain desc.

Instance matches for the subject and object

Relationship match for the predicate

Domain / Range check to ensure validity as per the domain

Validation Rule

Given

Set of sentences {Ti} with word-set W

Set TCTR of candidate constructs extracted by an extraction algorithm

Domain ontology with instances DOI=(O,I,h),

Mapping F from W to R [ I

Validation process results in a set K of validated constructs such that:

43/63




Validation Rule (Contd.)

{9 y1,y22 I,9 r2 R | {y1, r, y2} 2 K

, (9 w1, w2, w32 W, c1, c2 2 C | {w1, w3, w2} 2 TCTR Å {w1, y1} 2 F Å {w2, y2} 2 F Å

{w3, r} 2 F Å c1 2 h(y1) Å c2 2 h(y2) Å c1 2 Domain(r) Å c2 2 Range(r)}

Validation Rule - Illustrated

Ti = “Sachin Tendulkar scored 200 runs”

TCTR = {Sachin, scored, 200}

O = { C = {SportsPerson, Number}, R = {scoredRuns}}


I = {Sachin_Tendulkar, 200}

F = {{Sachin, Sachin_Tendulkar }, {scored, scoredRuns}, {200, 200}}

h (Sachin_Tendulkar ) = {SportsPerson}; h(200) = {Number}

44/63





{9 y1,y22 I,9 r2 R | {y1, r, y2} 2 K











45/63





{9 y1,y22 I,9 r2 R | {y1, r, y2} 2 K











{Sachin_Tendulkar, scoredRuns, 200} 2 K Holds

46/63





{9 y1,y22 I,9 r2 R | {y1, r, y2} 2 K






{Sachin_Tendulkar, scoredRuns, 200} 2 K Holds

Validation Process with

respect to the Domain

Ontology with Instances

47/63



Approach Representation Framework

Simple Relationships - Primitive Transformation

Extracted Information is - Ksimple = {{si, pi, oi} | {si, pi, oi} 2 K Å |oi| = 1}

TransformPrimitive({si, pi, oi} ) ! GRDF ({si, oi}, {pi})

The transformation is easily realized using the RDF triple notation

Primitive Transformation – Example

{ Sachin_Tendulkar, scoredRuns, 200 }

48/63




Complex Relationships - Composite Transformation

Extracted Information is - Kcomplex = {{si, pi, oi} | {si, pi, oi} 2 K Å |oi| > 1}

TransformComposite({si, pi, oi} ) ! {TransformPrimitive({si, pi, oi1} ) , …}

Transformation realized using RDF reification mechanism

Composite Transformation – Example

{Microsoft ad, says, { Mac_Unit, cool, customers } }

49/63



Approach Representation Algorithm

RDF Graph Generation Algorithm

Composite

Transformation

Primitive

Transformation

50/63


Evaluation

SEMANTIXS Architecture

Experimental Results and Analysis


Outline

51/63



Evaluation SEMANTIXS Architecture

SEMANTIXS

System to extract information from free-text in the form of complex (and

simple) relationships - https://sourceforge.net/projects/semantixs

Java-based Web Application utilizing:

Jena Semantic Web Toolkit

Stanford Parser Libraries

Google Web Toolkit

SVG Visualizer from HP Lab

Operates in 3 different modes – Trade-off between correctness &

coverage

Output conforms to W3C guidelines for RDF – Implicit graph

specification

Visualization facility to analyze entity-specific RDF sub-graphs

52/63

https://sourceforge.net/projects/semantixs



Evaluation SEMANTIXS Architecture

SEMANTIXS

Module implementing

the Recursive

Validation and

Representation

Algorithm

Module Implementing

the Extraction Rules

and related Logic

Module implementing

the Core validation

Logic

53/63



Evaluation Results and Analysis

Experimental Setup

Pre-annotated benchmark data-set unavailable for complex relationships

Gold standard in IE - Message Understanding Conf (MUC-1 to 7)

Mostly news articles related to military and civil themes

Focused on tasks related to entities, facts, events and attributes

Not rich enough in complex nested relationships

Chosen real-world Text, Ontology and Instances:

Followed suit with MUC – Selected news articles from CBSNews

Queried CBSNews.com1 for “Dow Jones”

Randomly selected 80 sentences across 4 articles

Utilized DBpedia ontology and a subset of types

1 Query - http://www.cbsnews.com/1770-5_162-0-4.html?query=Dow+Jones&searchtype=cbsSearch

2 DBpedia - http://wiki.dbpedia.org/Downloads34#dbpediaontology

54/63

http://www.cbsnews.com/1770-5_162-0-4.html?query=Dow+Jones&searchtype=cbsSearch











http://wiki.dbpedia.org/Downloads34






For complex relations (type 1 & 2), correctness judged based upon -

Correct structural representation extracted for complex relationship

Correct semantic representation extracted for the simple relationship within

Correct and complete extraction of all the relations contributes to each

individual count

Partially-correct extraction still contributes to the count for correctly

extracted relationship

Experimental Text: Counts of Pos and Neg Instances

Methodology used in Analyzing Correctness

55/63




Correctly Classified: Counts of Pos and Neg Instances

Experimental Text: Counts of Pos and Neg Instances

56/63




Simple relationships

False positives & negatives due to shallow syntactic comparisons in validation

For complex (types 1 & 2), correctness based on structure – Recall true measure

For Type 1 (Clause-level)

Most false negatives due to multi-level dependency structures and references

False positives – While validating outer subject and predicate [similar to simple]

For Type 2 (With Qualification)

Most false negatives while validating qualification and value [similar to simple]

Precision, Recall and F-measure

57/63




Type 3 (Conjunctions)

Correctness based on the expected construction of left and right fragments

Analysis of individual fragments falls under one of the other relationship types

References

High recall – Due to naïve pronoun resolution methodology

Low precision – Aggressive pronoun resolution leading to many false positives

Other Failing Cases – Algorithm not designed to handle them

Co-references

Negations, Or-conjunctions, etc

Outliers – Relevant instance but unexpected pattern in the dependency graph

Precision, Recall and F-measure

58/63




Querying the Graph

Example Graph

Extracted RDF metadata forms a Semantic Graph

Can be queried using SPARQL to answer complex questions

Performed queries to answer questions for the entity “Dow Jones”

59/63




Example Questions

Example Query

SELECT ?s1

WHERE {?s <type> <#Statement>.

?s <#subject> <dbpedia.org/page/Dow_Jones_Industrial_Average>.

?s1 ?p1 ?s; }

Look for all those subjects s1, which

have a statement s as their object

such that s talks about Dow Jones

Finding the subjects of assertions that were made about an entity

Who made any assertions about Dow Jones ?

Finding entities based on complex criteria

What are the entities that Dow Jones made qualified statements about ?

Finding entities based on relationship participation

Which entity appears in a fact with Dow Jones ?

60/63


Conclusion

Summary

Further Work


Outline

61/63



Conclusion Summary

Summary

We described a modular ontology-based approach to information

extraction for a subset of nested complex relationships

We illustrated a semantic representation of the extracted relationships in

the form of query-able (RDF) graphs

We described the system details of SEMANTIXS, a system for ontology-

guided extraction and semantic representation of structured information

from unstructured text and reported results to validate the proposed

approach

62/63



Conclusion Further Work

Further Work

Enhancements to improve the precision and recall of the system

Deep comparisons in validation

Consider synonyms, external resources, etc

Enhance pronoun resolution, co-reference resolution, etc

Complex knowledge discovery and question answering over the

extracted semantic graphs

Opinion mining and recommendation systems by creating semantic

graphs consisting entirely of opinions / recommendations

Extending the rule-base to capture more relationships, handle

negations, Or-conjunctions, etc

Perform domain analysis and ontology building

63/63



Conclusion

Thank You !

Sushain Pandit

[email protected]



Conclusion

Backup Slides

65/63




Existing Approaches for Information Extraction

Rule-based Approaches

Laborious but transparent in capturing complex semantic criteria

Best performing systems invariably use hand-crafted rules

Often rely on domain-specific trigger words

Automatic pattern induction (statistical methods)

Co-occurrence – statistically significant associations

Require a lot of labeled text corpora (hard to acquire for

complex relations)

Cluster Analysis – similarity measure

Require computational cost for feature preparation

66/63




Derivative Structures from Text

Complete lack of semantics necessitate an intermediate representation

Linguistic Parsers used to generate data structures with respect to a

formal grammar

Popular parsing libraries

Natural Language Toolkit (NLTK)

Two-Stage Discriminative Parser by McDonald, et al

Stanford Parser

Stanford Parser chosen based on

Flexibility of representation

Accuracy in dependency analysis, parsing, tagging, chunking, etc.

Processing Speed

67/63




Terminology

pi: ith condition or premise for a rule (defined below).

cj: jth action or consequent for a rule, corresponding to a set {pi}

G(V,E): A dependency graph with vertex-set V and edge-set E

GS(V '): Subgraph of G induced by the vertex-set V '

D: A set of labels denoting the typed dependency relations

l:E!D: A function that associates labels to the edges in G

Extraction Rule

For a dependency graph G, we define an extraction rule as:

rk: {pi} ! {cj}, meaning – If {pi} holds, perform {cj}

68/63





Forming Extraction Rule

pred1 = {Node with two outgoing edges with labels “nsubj” and “ccomp”}

sub1 = {Node (node1) that is connected to pred1 by edge with label “nsubj”, Node

connected to node1 by an edge with label “nn” or “quantmod”}

Formalized Extraction Rule:

rRIC1: {9 u, v, w 2 V, 9 e1(u, v), e2(v, w) 2 E | l(e1) = “nsubj” Å (l(e2) 2 {“ccomp”,

“parataxis”) ! {pred1 = {v}, sub1 = {u}}

rRIC2: {9 u, v, w, t 2 V, 9 e1(u, v), e2(v, w), e3(u, t) 2 E | l(e1) = “nsubj” Å (l(e2) 2

{“ccomp”, “parataxis”) Å (l(e3) 2 {“nn”, “quantmod”) ! {sub1 = sub1 [{t}}

69/63





Forming Extraction Rule

pred1 = {Node with two outgoing edges with labels “nsubj” and “dobj”}

sub1 = {Node (node1) that is connected to pred1 by edge with label “nsubj”,

Node connected to node1 by an edge with label “nn” or “quantmod”}

obj1 = {Node (node2) that is connected to pred1 by edge with label “dobj”, Node

connected to node2 by an edge with label “nn” or “quantmod”}

qual1 = {Node with two edges labeled “prep” and “amod”}

val1 = {Node that is connected to qual1 by the edge with label “amod”}

70/63




Identifying Simple Relationship Type

Simple Relationships:

At most one subject and object each

No clause-level dependencies, conjunctions, or a clausal subject

Only noun-compound, or adjectival modifiers

In terms of Stanford dependencies, this implies:

At most one dependency of type nsubj

At most one dependency from the set {dobj, pobj}

No dependencies from the set {ccomp, xcomp, acomp, compl, conj, etc.}.

Only *mod = {amod, quantmod, nn} as modifiers

71/63




Rules for Entity & Relation Extraction – Simple

Dependency Graph

Forming Extraction Rule:

pred1 = {Node with two outgoing edges with labels “nsubj” and “dobj”}

sub1 = {Node (node1) that is connected to pred1 by edge with label “nsubj”, Node

connected to node1 by an edge with label “nn” or “*mod”}

obj1 = {Node (node2) that is connected to pred1 by edge with label “dobj”, Node

connected to node2 by an edge with label “nn” or “*mod”}

72/63




Rule-based Entity and Relationship Extraction Algorithm

Handle Clausal

Relationships

recursively

Handle Conjunctions

by Analyzing the

Structure of the

Sentence Parse

Apply Extraction Rules on

the Input Dependency

Graph

Store Information

Constructs for Pronoun

Resolution

73/63




Rule-based Entity and Relationship Extraction Algorithm

Handle

Qualified

Relationships

using

Enrichments

Utilize Stored Information

Constructs for Forward

Reference Resolution

Handle Simple

Relationships

74/63



Approach Overall Algorithm

Overall Algorithm to Extract Information from Text

Extract All Candidate

Information Constructs

for the sentence

Validation and

Represent the

Extracted Information

Constructs

75/63



Approach Discussion

Claim 1: The resulting graphs from TransformPrimitive and

TransformComposite are valid RDF fragments [Follows from the definitions of

TransformPrimitive and TransformComposite]

Claim 2: There always exists a transformation from a valid (syntactically

and w.r.t domain definition) natural language sentence containing at least

one of the relationship types identified by us, to a graph formalism such that

the underlying information expressed in the relationship is captured in a

query-able form in the graph [Follows from the algorithms in Composite Extraction

and Semantic Validation frameworks and Claim 1]

Claims Based on the Described Frameworks

76/63




Transforming Validated Constructs into Graph(s)

Seek transformation from the set K={{si, pi, oi}} of validated constructs

to a (RDF) Graph, GRDF (V, E) such that

the transformation be able to represent all types of validated

constructs for all relationship types

the resulting graph(s) conform to valid RDF specification

transformation for complex relationship types be easily realized

using either simple triple notation, or RDF reification mechanism

77/63




Complex Relationships - Composite Transformation

Extracted Information is - Kcomplex = {{si, pi, oi} | {si, pi, oi} 2 K Å |oi| > 1}

TransformComposite({si, pi, oi} ) !

TransformPrimitive({si, pi, t} ) , TransformPrimitive({t, obj, ooi} )

TransformPrimitive({t, pred, poi} ), TransformPrimitive({t, sub, soi} )

TransformPrimitive({t, stmt, id} )

78/63




Complex Type 2 – Composite Transformation

Relationships with qualifications represented as a set of Primitive

Transformations

Inputs for the Primitive Transformations created by Enrichments module

79/63




Correctly Classified: Counts of Pos and Neg Instances

Confusion Matrices

80/63

ontology-guided information extraction from unstructured textsemantix/sushain_thesis... ·...

Documents