ontology-guided information extraction from unstructured textsemantix/sushain_thesis... ·...
TRANSCRIPT
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Sushain Pandit
Department of Computer Science
Iowa State University
Ames, Iowa, USA
M.S. Thesis Defense
April 29, 2010
Ontology-guided Extraction of Structured Information from
Unstructured Text:
Identifying and Capturing Complex Relationships
1/63
Iowa State University Department of Computer Science
Introduction
Motivation
Contributions
Problem
Related Concepts
Definition
Approach
Composite Extraction Framework
Semantic Validation Framework
Representation Framework
Evaluation
SEMANTIXS Architecture
Experimental Results and Analysis
Conclusion
Summary
Further Work
Copyright © Sushain Pandit, 2010.
Outline
2/63
Iowa State University Department of Computer Science
Introduction
Motivation
Contributions
Copyright © Sushain Pandit, 2010.
Outline
3/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Motivation
Information Extraction from Text
Process of extracting interesting information from unstructured text
Entities – Persons, Organizations, Locations, etc
Attributes – Name, Descriptors, Categories, etc
Events – Company established in 2010
Relationships – Person works for Organization
Co-references – IBM and International Business Machines
….
Our Focus – A subset of complex nested relationships
4/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Motivation
Motivating Example - A Moderately Complex Sentence
Sports News predicts that Sachin Tendulkar may score a double-hundred with high probability
and retire in 2015
5/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Motivation
Motivating Example – Usual Information Extraction Scene
Sports News predicts that Sachin Tendulkar may score a double-hundred with high probability
and retire in 2015
EntityEntities
Event
Relationship
Attribute
6/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Motivation
Motivating Example – Nested Relationships
Sports
News
predicts
that
Sachin Tendulkar may score a
double-hundred
Clause-level DependencyA Qualifying Modifier
Outer Relationship
Dependency
with high
probability
Inner Clause subject to the
Qualifying Modifier
andretire in
2015
Conjunction creating
dependencies between parts of the
sentence
Left part governing the meaning of
right part
7/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Motivation
Motivating Example – Domain Description
Sports
News
predicts
that
Sachin Tendulkar may score a
double-hundred
with high
probabilityand
retire in
2015
Internationally recognized
Sportsperson
Or
Someone else by the same
name?
8/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Motivation
Motivating Example – Domain Description
Sports
News
predicts
that
Sachin Tendulkar may score a
double-hundred
with high
probabilityand
retire in
2015
Domain Description in the form of a domain ontology
Sachin_Tendulkar type SportsPerson
Internationally recognized
Sportsperson
9/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Motivation
Motivating Example – Representation
Sports
News
predicts
that
Sachin_Tendulkar
high
probability
retire
scores
double-hundred
predicts
that
2015
Sachin_Tendulkar
Some Semantic Graph
Formalism that can Capture
the Structured Information
10/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Motivation
Existing Approaches for Information Extraction
Rule-based Approaches
Laborious but transparent in capturing complex semantic criteria
Best performing systems invariably use hand-crafted rules
Often rely on domain-specific trigger words
Automatic pattern induction (statistical methods)
Co-occurrence – Require lot of labeled text corpora
Cluster Analysis – Require computational cost
Comprehensive surveys
N. Bach and S. Badaskar, 2007; G. Neumann and F. Xu, 2004
Recall: Our Focus – A subset of complex nested relationships
Our Approach – Domain independent rule formulation
11/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Contributions
Contributions
A modular ontology-based approach for extraction of a subset of nested-
complex relationships that decouples domain-specific knowledge from
the rules used for information extraction
A framework to semantically represent the extracted relationships in the
form of query-able RDF graphs
Provide open-source implementation of SEMANTIXS, a system for
ontology-guided extraction of structured information from text
Report results of some experiments to validate the proposed approach
12/63
Iowa State University Department of Computer Science
Problem
Related Concepts
Definition
Copyright © Sushain Pandit, 2010.
Outline
13/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Problem Related Concepts
Parse Trees
Ordered and rooted trees representing the syntactic structure
Penn Treebank1 notation for tagging the sentence
S: Simple declarative clause
NP: Categorizes all constituents depending on a head noun.
VP: Categorizes all constituents headed a verb.
1 Refer - ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/notation.tex for a complete list
Example
S
NP VP
Heart Attack Causes reduced
averagelifespan
NP
14/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Problem Related Concepts
Dependency Graphs
Structures capturing implicit dependencies (sentential semantics)
between the tokens of a sentence
Stanford dependency notation2 for labeling the graph
2 Refer Handout. URL for a complete list - http://nlp.stanford.edu/software/dependencies manual.pdf
Example
15/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Problem Related Concepts
Formal Specifications for Validation and Representation
Domains generally described by concepts, relationships and instances
Need for a formalism to capture the domain description
Need for a suitable representation mechanism
An ontology is a structure O=(R,C) such that:
The sets R and C are disjoint and their elements are called relations
and concepts respectively
The elements in R induce a strict partial order on the elements in C
O = {{SportsPerson, Person, Number}, {scoredRuns}}
Domain (scoredRuns) = {SportsPerson}, Range (scoredRuns) = Number
Ontology
16/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Problem Related Concepts
Domain Ontology with Instances
A domain ontology with instances is a structure DOI=(O,I,h) such that:
I is a set, whose elements are called instances
There exists a function h:I ! P(C), where P(C) is the power-set of
the set of concepts for the ontology O
Example
O = {{SportsPerson, Person, Number}, {scoredRuns}}
Domain (scoredRuns) = {SportsPerson}, Range (scoredRuns) = Number
I = {John, Steve, Sachin_Tendulkar}
h (Sachin_Tendulkar ) = {SportsPerson, Person}
h (John) = {Person}
17/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Problem Related Concepts
Resource Description Framework (RDF)
Example – RDF Triple
Resources described using properties and values using RDF statements
Statements represented as RDF triples, consisting of a subject,
predicate and object
Unique Resource Identified (URI) for Resources
RDF Reification – Special mechanism to make assertions about
statements instead of entities
Sachin Tendulkar scored 200 runs
18/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Problem Definition
Ontology-guided Structured Information Extraction
Given:
Text fragment consisting of sentences {Ti}
Domain ontology with instances DOI=(O,I,h)
Ontology-guided structured information extraction:
Determines a set TCTR of candidate information constructs
using entity and relationship extraction algorithm(s)
Validates TCTR with respect to DOI and finds a set K of
validated information constructs
Represent triples in K using a suitable mechanism
Remainder of the presentation – Details of the above three steps
19/63
Iowa State University Department of Computer Science
Approach
Composite Extraction Framework
Copyright © Sushain Pandit, 2010.
Outline
20/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Recall: Motivating Example
Sports
News
predicts
that
Sachin Tendulkar may score a
double-hundred
Clause-level DependencyA Qualifying Modifier
Outer Relationship
Dependency
with high
probability
Inner Clause subject to the
Qualifying Modifier
andretire in
2015
Conjunction creating
dependencies between parts of the
sentence
Left part governing the meaning of
right part
Approach Composite Extraction Framework
21/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Terminology -Extraction Rule
Rule –
“Label(s) from {nn, amod} occur along the edges connected to an nsubj node”
! “Group the associated nodes with the nsubj node”
“Labels nsubj & dobj occur along a set of adjacent edges” ! “Extract the nodes
associated with those edges as information constructs”
Result – Extraction of {{Heat, attack, causes}, reduced, {average, lifespan}} as
candidate information construct
22/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Motivating Example: Identifying Sub-problems
Sports
News
predicts
that
Sachin Tendulkar may score a
double-hundred
Clause-level Dependency
Outer Relationship
Dependency
Inner Clause subject to the
Qualifying Modifier
Approach Composite Extraction Framework
23/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Identifying Complex Relationship - Type 1
Relationships with Internal Clauses:
Variants:
That Macs are too cool for its customers, says Microsoft ad
Microsoft ad says: Macs are too cool for its customers
24/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rules for Entity & Relation Extraction – Type 1
Dependency Graph:
25/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rules for Entity & Relation Extraction – Type 1
Expected Extraction Rule Behavior:
Clausal Complement – ccomp
Variants - parataxis
Leave this for later – Recursively
reduced to one of the other types
26/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Motivating Example: Identifying Sub-problems
Sachin Tendulkar may score a
double-hundred
A Qualifying Modifier
with high
probability
Approach Composite Extraction Framework
27/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Identifying Complex Relationship - Type 2
Relationships with Qualifiers:
Variants:
With high probability, Sachin Tendulkar may score a double-hundred.
There is a high probability that Sachin Tendulkar may score a double-hundred
28/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rules for Entity & Relation Extraction – Type 2
Dependency Graph:
29/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rules for Entity & Relation Extraction – Type 2
Expected Extraction Rule Behavior:
Prepositional Modifier – prep
Variants – prep_xxx
Adjectival Modifier – amod
Prep – amod Pattern Identifies
this Relationship Type
30/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Motivating Example: Identifying Sub-problems
Sachin Tendulkar may score a
double-hundredand
retire in
2015
Conjunction creating
dependencies between parts of the
sentence
Left part governing the meaning of
right part
Approach Composite Extraction Framework
31/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Identifying Complex Relationship – Type 3
Relationships with Conjunctions:
32/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rules for Entity & Relation Extraction – Type 3
Conjunctions connect parts having immediate dependencies
Reference resolution required between the parts
Utilize Sentence Parses instead of dependency graphs
Formulation:
If right-part contains Simple Declarative Clause (S), process as a
distinct sentence
If right-part contains Verb and Noun Phrases (VP, NP), use the subject
of left-part and process as a distinct sentence
If right-part contains only NP, use the subject and object of left-part
and process as a distinct sentence
33/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Motivating Example: Identifying Sub-problems
Sachin Tendulkar may score a
double-hundred
Approach Composite Extraction Framework
34/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rules for Entity & Relation Extraction – Simple
Dependency Graph
35/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Extraction Algorithm - Illustration
Sports News predicts that
Sachin Tendulkar may score a double-hundred
with high probability
and retire in 2015
36/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Extraction Algorithm - Illustration
{Sports News, predicts}
Sachin Tendulkar may score a double-hundred
with high probability
and
retire in 2015
37/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Extraction Algorithm – Illustration
Sachin Tendulkar may score a double-hundred
with high probability
38/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Extraction Algorithm – Illustration
retire in 2015
Append – Sachin Tendulkar
Sachin Tendulkar retire in 2015
39/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Extraction Algorithm – Illustration
retire in 2015
Append – Sachin Tendulkar
Sachin Tendulkar retire in 2015
40/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Extraction Algorithm - Illustration
{Sports News, predicts
{ Sachin Tendulkar, scored,
double hundred, probability, high } }
{ Sports News, predicts
{ Sachin Tendulkar, retire, 2015 } }
41/63
Iowa State University Department of Computer Science
Approach
Composite Extraction Framework
Semantic Validation Framework
Representation Framework
Copyright © Sushain Pandit, 2010.
Outline
42/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Validation Framework
Validation using Domain Ontology
Extracted Information Constructs to be matched against the domain desc.
Instance matches for the subject and object
Relationship match for the predicate
Domain / Range check to ensure validity as per the domain
Validation Rule
Given
Set of sentences {Ti} with word-set W
Set TCTR of candidate constructs extracted by an extraction algorithm
Domain ontology with instances DOI=(O,I,h),
Mapping F from W to R [ I
Validation process results in a set K of validated constructs such that:
43/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Validation Framework
Validation Rule (Contd.)
{9 y1,y22 I,9 r2 R | {y1, r, y2} 2 K
, (9 w1, w2, w32 W, c1, c2 2 C | {w1, w3, w2} 2 TCTR Å {w1, y1} 2 F Å {w2, y2} 2 F Å
{w3, r} 2 F Å c1 2 h(y1) Å c2 2 h(y2) Å c1 2 Domain(r) Å c2 2 Range(r)}
Validation Rule - Illustrated
Ti = “Sachin Tendulkar scored 200 runs”
TCTR = {Sachin, scored, 200}
O = { C = {SportsPerson, Number}, R = {scoredRuns}}
Domain (scoredRuns) = {SportsPerson}, Range (scoredRuns) = Number
I = {Sachin_Tendulkar, 200}
F = {{Sachin, Sachin_Tendulkar }, {scored, scoredRuns}, {200, 200}}
h (Sachin_Tendulkar ) = {SportsPerson}; h(200) = {Number}
44/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Validation Framework
Validation Rule (Contd.)
{9 y1,y22 I,9 r2 R | {y1, r, y2} 2 K
, (9 w1, w2, w32 W, c1, c2 2 C | {w1, w3, w2} 2 TCTR Å {w1, y1} 2 F Å {w2, y2} 2 F Å
{w3, r} 2 F Å c1 2 h(y1) Å c2 2 h(y2) Å c1 2 Domain(r) Å c2 2 Range(r)}
Validation Rule - Illustrated
Ti = “Sachin Tendulkar scored 200 runs”
TCTR = {Sachin, scored, 200}
O = { C = {SportsPerson, Number}, R = {scoredRuns}}
Domain (scoredRuns) = {SportsPerson}, Range (scoredRuns) = Number
I = {Sachin_Tendulkar, 200}
F = {{Sachin, Sachin_Tendulkar }, {scored, scoredRuns}, {200, 200}}
h (Sachin_Tendulkar ) = {SportsPerson}; h(200) = {Number}
45/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Validation Framework
Validation Rule (Contd.)
{9 y1,y22 I,9 r2 R | {y1, r, y2} 2 K
, (9 w1, w2, w32 W, c1, c2 2 C | {w1, w3, w2} 2 TCTR Å {w1, y1} 2 F Å {w2, y2} 2 F Å
{w3, r} 2 F Å c1 2 h(y1) Å c2 2 h(y2) Å c1 2 Domain(r) Å c2 2 Range(r)}
Validation Rule - Illustrated
Ti = “Sachin Tendulkar scored 200 runs”
TCTR = {Sachin, scored, 200}
O = { C = {SportsPerson, Number}, R = {scoredRuns}}
Domain (scoredRuns) = {SportsPerson}, Range (scoredRuns) = Number
I = {Sachin_Tendulkar, 200}
F = {{Sachin, Sachin_Tendulkar }, {scored, scoredRuns}, {200, 200}}
h (Sachin_Tendulkar ) = {SportsPerson}; h(200) = {Number}
{Sachin_Tendulkar, scoredRuns, 200} 2 K Holds
46/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Validation Framework
Validation Rule (Contd.)
{9 y1,y22 I,9 r2 R | {y1, r, y2} 2 K
, (9 w1, w2, w32 W, c1, c2 2 C | {w1, w3, w2} 2 TCTR Å {w1, y1} 2 F Å {w2, y2} 2 F Å
{w3, r} 2 F Å c1 2 h(y1) Å c2 2 h(y2) Å c1 2 Domain(r) Å c2 2 Range(r)}
Validation Rule - Illustrated
Ti = “Sachin Tendulkar scored 200 runs”
TCTR = {Sachin, scored, 200}
{Sachin_Tendulkar, scoredRuns, 200} 2 K Holds
Validation Process with
respect to the Domain
Ontology with Instances
47/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Representation Framework
Simple Relationships - Primitive Transformation
Extracted Information is - Ksimple = {{si, pi, oi} | {si, pi, oi} 2 K Å |oi| = 1}
TransformPrimitive({si, pi, oi} ) ! GRDF ({si, oi}, {pi})
The transformation is easily realized using the RDF triple notation
Primitive Transformation – Example
{ Sachin_Tendulkar, scoredRuns, 200 }
48/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Representation Framework
Complex Relationships - Composite Transformation
Extracted Information is - Kcomplex = {{si, pi, oi} | {si, pi, oi} 2 K Å |oi| > 1}
TransformComposite({si, pi, oi} ) ! {TransformPrimitive({si, pi, oi1} ) , …}
Transformation realized using RDF reification mechanism
Composite Transformation – Example
{Microsoft ad, says, { Mac_Unit, cool, customers } }
49/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Representation Algorithm
RDF Graph Generation Algorithm
Composite
Transformation
Primitive
Transformation
50/63
Iowa State University Department of Computer Science
Evaluation
SEMANTIXS Architecture
Experimental Results and Analysis
Copyright © Sushain Pandit, 2010.
Outline
51/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation SEMANTIXS Architecture
SEMANTIXS
System to extract information from free-text in the form of complex (and
simple) relationships - https://sourceforge.net/projects/semantixs
Java-based Web Application utilizing:
Jena Semantic Web Toolkit
Stanford Parser Libraries
Google Web Toolkit
SVG Visualizer from HP Lab
Operates in 3 different modes – Trade-off between correctness &
coverage
Output conforms to W3C guidelines for RDF – Implicit graph
specification
Visualization facility to analyze entity-specific RDF sub-graphs
52/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation SEMANTIXS Architecture
SEMANTIXS
Module implementing
the Recursive
Validation and
Representation
Algorithm
Module Implementing
the Extraction Rules
and related Logic
Module implementing
the Core validation
Logic
53/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation Results and Analysis
Experimental Setup
Pre-annotated benchmark data-set unavailable for complex relationships
Gold standard in IE - Message Understanding Conf (MUC-1 to 7)
Mostly news articles related to military and civil themes
Focused on tasks related to entities, facts, events and attributes
Not rich enough in complex nested relationships
Chosen real-world Text, Ontology and Instances:
Followed suit with MUC – Selected news articles from CBSNews
Queried CBSNews.com1 for “Dow Jones”
Randomly selected 80 sentences across 4 articles
Utilized DBpedia ontology and a subset of types
1 Query - http://www.cbsnews.com/1770-5_162-0-4.html?query=Dow+Jones&searchtype=cbsSearch
2 DBpedia - http://wiki.dbpedia.org/Downloads34#dbpediaontology
54/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation Results and Analysis
For complex relations (type 1 & 2), correctness judged based upon -
Correct structural representation extracted for complex relationship
Correct semantic representation extracted for the simple relationship within
Correct and complete extraction of all the relations contributes to each
individual count
Partially-correct extraction still contributes to the count for correctly
extracted relationship
Experimental Text: Counts of Pos and Neg Instances
Methodology used in Analyzing Correctness
55/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation Results and Analysis
Correctly Classified: Counts of Pos and Neg Instances
Experimental Text: Counts of Pos and Neg Instances
56/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation Results and Analysis
Simple relationships
False positives & negatives due to shallow syntactic comparisons in validation
For complex (types 1 & 2), correctness based on structure – Recall true measure
For Type 1 (Clause-level)
Most false negatives due to multi-level dependency structures and references
False positives – While validating outer subject and predicate [similar to simple]
For Type 2 (With Qualification)
Most false negatives while validating qualification and value [similar to simple]
Precision, Recall and F-measure
57/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation Results and Analysis
Type 3 (Conjunctions)
Correctness based on the expected construction of left and right fragments
Analysis of individual fragments falls under one of the other relationship types
References
High recall – Due to naïve pronoun resolution methodology
Low precision – Aggressive pronoun resolution leading to many false positives
Other Failing Cases – Algorithm not designed to handle them
Co-references
Negations, Or-conjunctions, etc
Outliers – Relevant instance but unexpected pattern in the dependency graph
Precision, Recall and F-measure
58/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation Results and Analysis
Querying the Graph
Example Graph
Extracted RDF metadata forms a Semantic Graph
Can be queried using SPARQL to answer complex questions
Performed queries to answer questions for the entity “Dow Jones”
59/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation Results and Analysis
Example Questions
Example Query
SELECT ?s1
WHERE {?s <type> <#Statement>.
?s <#subject> <dbpedia.org/page/Dow_Jones_Industrial_Average>.
?s1 ?p1 ?s; }
Look for all those subjects s1, which
have a statement s as their object
such that s talks about Dow Jones
Finding the subjects of assertions that were made about an entity
Who made any assertions about Dow Jones ?
Finding entities based on complex criteria
What are the entities that Dow Jones made qualified statements about ?
Finding entities based on relationship participation
Which entity appears in a fact with Dow Jones ?
60/63
Iowa State University Department of Computer Science
Conclusion
Summary
Further Work
Copyright © Sushain Pandit, 2010.
Outline
61/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Conclusion Summary
Summary
We described a modular ontology-based approach to information
extraction for a subset of nested complex relationships
We illustrated a semantic representation of the extracted relationships in
the form of query-able (RDF) graphs
We described the system details of SEMANTIXS, a system for ontology-
guided extraction and semantic representation of structured information
from unstructured text and reported results to validate the proposed
approach
62/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Conclusion Further Work
Further Work
Enhancements to improve the precision and recall of the system
Deep comparisons in validation
Consider synonyms, external resources, etc
Enhance pronoun resolution, co-reference resolution, etc
Complex knowledge discovery and question answering over the
extracted semantic graphs
Opinion mining and recommendation systems by creating semantic
graphs consisting entirely of opinions / recommendations
Extending the rule-base to capture more relationships, handle
negations, Or-conjunctions, etc
Perform domain analysis and ontology building
63/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Conclusion
Thank You !
Sushain Pandit
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Conclusion
Backup Slides
65/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Introduction Motivation
Existing Approaches for Information Extraction
Rule-based Approaches
Laborious but transparent in capturing complex semantic criteria
Best performing systems invariably use hand-crafted rules
Often rely on domain-specific trigger words
Automatic pattern induction (statistical methods)
Co-occurrence – statistically significant associations
Require a lot of labeled text corpora (hard to acquire for
complex relations)
Cluster Analysis – similarity measure
Require computational cost for feature preparation
66/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Problem Related Concepts
Derivative Structures from Text
Complete lack of semantics necessitate an intermediate representation
Linguistic Parsers used to generate data structures with respect to a
formal grammar
Popular parsing libraries
Natural Language Toolkit (NLTK)
Two-Stage Discriminative Parser by McDonald, et al
Stanford Parser
Stanford Parser chosen based on
Flexibility of representation
Accuracy in dependency analysis, parsing, tagging, chunking, etc.
Processing Speed
67/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Terminology
pi: ith condition or premise for a rule (defined below).
cj: jth action or consequent for a rule, corresponding to a set {pi}
G(V,E): A dependency graph with vertex-set V and edge-set E
GS(V '): Subgraph of G induced by the vertex-set V '
D: A set of labels denoting the typed dependency relations
l:E!D: A function that associates labels to the edges in G
Extraction Rule
For a dependency graph G, we define an extraction rule as:
rk: {pi} ! {cj}, meaning – If {pi} holds, perform {cj}
68/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rules for Entity & Relation Extraction – Type 1
Forming Extraction Rule
pred1 = {Node with two outgoing edges with labels “nsubj” and “ccomp”}
sub1 = {Node (node1) that is connected to pred1 by edge with label “nsubj”, Node
connected to node1 by an edge with label “nn” or “quantmod”}
Formalized Extraction Rule:
rRIC1: {9 u, v, w 2 V, 9 e1(u, v), e2(v, w) 2 E | l(e1) = “nsubj” Å (l(e2) 2 {“ccomp”,
“parataxis”) ! {pred1 = {v}, sub1 = {u}}
rRIC2: {9 u, v, w, t 2 V, 9 e1(u, v), e2(v, w), e3(u, t) 2 E | l(e1) = “nsubj” Å (l(e2) 2
{“ccomp”, “parataxis”) Å (l(e3) 2 {“nn”, “quantmod”) ! {sub1 = sub1 [{t}}
69/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rules for Entity & Relation Extraction – Type 2
Forming Extraction Rule
pred1 = {Node with two outgoing edges with labels “nsubj” and “dobj”}
sub1 = {Node (node1) that is connected to pred1 by edge with label “nsubj”,
Node connected to node1 by an edge with label “nn” or “quantmod”}
obj1 = {Node (node2) that is connected to pred1 by edge with label “dobj”, Node
connected to node2 by an edge with label “nn” or “quantmod”}
qual1 = {Node with two edges labeled “prep” and “amod”}
val1 = {Node that is connected to qual1 by the edge with label “amod”}
70/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Identifying Simple Relationship Type
Simple Relationships:
At most one subject and object each
No clause-level dependencies, conjunctions, or a clausal subject
Only noun-compound, or adjectival modifiers
In terms of Stanford dependencies, this implies:
At most one dependency of type nsubj
At most one dependency from the set {dobj, pobj}
No dependencies from the set {ccomp, xcomp, acomp, compl, conj, etc.}.
Only *mod = {amod, quantmod, nn} as modifiers
71/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rules for Entity & Relation Extraction – Simple
Dependency Graph
Forming Extraction Rule:
pred1 = {Node with two outgoing edges with labels “nsubj” and “dobj”}
sub1 = {Node (node1) that is connected to pred1 by edge with label “nsubj”, Node
connected to node1 by an edge with label “nn” or “*mod”}
obj1 = {Node (node2) that is connected to pred1 by edge with label “dobj”, Node
connected to node2 by an edge with label “nn” or “*mod”}
72/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rule-based Entity and Relationship Extraction Algorithm
Handle Clausal
Relationships
recursively
Handle Conjunctions
by Analyzing the
Structure of the
Sentence Parse
Apply Extraction Rules on
the Input Dependency
Graph
Store Information
Constructs for Pronoun
Resolution
73/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Composite Extraction Framework
Rule-based Entity and Relationship Extraction Algorithm
Handle
Qualified
Relationships
using
Enrichments
Utilize Stored Information
Constructs for Forward
Reference Resolution
Handle Simple
Relationships
74/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Overall Algorithm
Overall Algorithm to Extract Information from Text
Extract All Candidate
Information Constructs
for the sentence
Validation and
Represent the
Extracted Information
Constructs
75/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Discussion
Claim 1: The resulting graphs from TransformPrimitive and
TransformComposite are valid RDF fragments [Follows from the definitions of
TransformPrimitive and TransformComposite]
Claim 2: There always exists a transformation from a valid (syntactically
and w.r.t domain definition) natural language sentence containing at least
one of the relationship types identified by us, to a graph formalism such that
the underlying information expressed in the relationship is captured in a
query-able form in the graph [Follows from the algorithms in Composite Extraction
and Semantic Validation frameworks and Claim 1]
Claims Based on the Described Frameworks
76/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Representation Framework
Transforming Validated Constructs into Graph(s)
Seek transformation from the set K={{si, pi, oi}} of validated constructs
to a (RDF) Graph, GRDF (V, E) such that
the transformation be able to represent all types of validated
constructs for all relationship types
the resulting graph(s) conform to valid RDF specification
transformation for complex relationship types be easily realized
using either simple triple notation, or RDF reification mechanism
77/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Representation Framework
Complex Relationships - Composite Transformation
Extracted Information is - Kcomplex = {{si, pi, oi} | {si, pi, oi} 2 K Å |oi| > 1}
TransformComposite({si, pi, oi} ) !
TransformPrimitive({si, pi, t} ) , TransformPrimitive({t, obj, ooi} )
TransformPrimitive({t, pred, poi} ), TransformPrimitive({t, sub, soi} )
TransformPrimitive({t, stmt, id} )
78/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Approach Representation Framework
Complex Type 2 – Composite Transformation
Relationships with qualifications represented as a set of Primitive
Transformations
Inputs for the Primitive Transformations created by Enrichments module
79/63
Iowa State University Department of Computer Science
Copyright © Sushain Pandit, 2010.
Evaluation Results and Analysis
Correctly Classified: Counts of Pos and Neg Instances
Confusion Matrices
80/63