cilc2011 a framework for structured knowledge extraction and representation from natural language...

26
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio Alessio Paolucci Università Degli Studi Dell’Aquila

Upload: lilliana-harrington

Post on 01-Apr-2015

227 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

CILC2011

A framework for structured knowledge extraction and representation from natural language via deep sentence analysis

Stefania Costantini Niva Florio

Alessio Paolucci

UniversitàDegli Studi Dell’Aquila

Page 2: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Outline

1. Motivation

2. Our Proposal

3. Workflow

4. Deep Analysis: Parsing & Dependency Structure

5. Context Disambiguation

6. Resolution

7. OOLOT

8. RDF/OWL Exporting

9. Example

10. Conclusion

Page 3: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

“overcome the knowledge acquisition

bottleneck”

Motivation

Page 4: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Motivation

Structured data from

plain text

SemanticWeb

Ontology Populatio

n

NLP2RDF

Structured Query

The more interesting one:

Ontology population (Semantic Web)

…but

endless possibilities!!!

Page 5: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Our Proposal

Our framework allows us to:

Extract knowledge from natural language sentences using a deep analysis technique based on linguistic dependencies and phrase syntactic structure.

Use OOLOT (Ontology Oriented Language of Thought) an intermediate language based on ASP (Answer Set Programming), specifically designed for the representation of the distinctive features of the knowledge extracted from natural language.

Easily Integrate our framework in the context of the Semantic Web.

OOLOT lets us exploit the non monotonic reasoning (through ASP) to deal with common sense reasoning and other typical aspects of the knowledge encoded through the Natural Language.

Page 6: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Workflow

Page 7: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Parsing

Syntactic Parsing: It can determine the syntactic structure of a sentence Chomsky’s constituent analysis It builds up the elements in their hierarchical order Syntactic parsers decompose a text into tokens and attribute them

their grammatical function

Statistical Parsing: It is based on a corpus of training annotated data It gathers information about the frequency with which the

elements are needed in specific contexts Only statistic may be not enough to determine when to split a

symbol in sub-symbols

Probabilistic Context Free Grammar (PCFG): More than one production rule may apply to a sequence of words,

thus resulting in a conflict It uses the frequency of various productions to order them

Page 8: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Parsing

Stanford Parser: PCFG parser

Page 9: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Parsing

Statistical parsing is useful to solve problems like ambiguity and efficiency

We lose part of the semantic information

BUT

Dependency Grammar:words in a sentence are connected by means of binary, asymmetrical

governor-dependent relationships

Page 10: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Context Disambiguation

Given a (finite) set of contexts, assign each lexical item to one (or more) context(s) including a score.

Context_1 Context_2 Context… Context… Context_m

Lexical Item

0.7

0.3

We use a simple, frequency-based, disambiguation algorithm.

Page 11: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Resolution

Car

<http://dbpedia.org/resource/Car>

Each lexical item (a word, or a set of), is resolved against popular ontologies, including DBPedia, YAGO, GeoNames, WordNet 3 OWL, …

Page 12: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

OOLOT

The language of thought is an intermediate format mainly inspired by Kowalski’s LoT.

It has been introduced to represent the extracted knowledge in a way that is totally independent from original lexical items and, therefore, from original language.

Our LOT is itself a language, but its lexicon is ontology oriented, so we adopted the acronym OOLOT (Ontology Oriented Language Of Thought).

OOLOT is used to represent the knowledge extracted from natural language sentences, so basically the bricks of OOLOT (lexicons) are ontological identifier related to concepts (in the ontology), and they are not a translation at lexical level.

Page 13: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

OOLOT: Lambda-based translation

Example:

“Many girls eat apples”

Page 14: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

OOLOT: Lambda-based translation

Example:

“Many girls eat apples”

Page 15: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

OOLOT: Lambda-based translation

Page 16: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

OOLOT: Lambda-based translation

Page 17: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

OOLOT: Lambda-based translation

And, finally, after applying apple to the previous partial expression, we have:

Page 18: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

RDF/OWL Exporting

Since OOLOT is designed to have a representation very close to RDF, it's possible to export toward RDF/OWL. In many cases, when is possible to maintain the semantic, there is a 1:1 mapping, otherwise we're starting using RDF/OWL syntactic approximations through reification (when you can’t preserve the original semantic)

OOLOT: predicate(subject, object)

RDF: <subject, predicate, object>

Best case:

Page 19: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Framework In Action

“Ferrari is an Italian sports car manufacturer based in Maranello.”

Page 20: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Framework in Action

Page 21: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Framework in Action

Page 22: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Framework in Action

Page 23: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Framework in Action

Page 24: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Framework in Action

Page 25: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Framework in Action

Page 26: CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio

Conclusion & Future Works

OOLOT Further exploit:

OOLOT language

ASP to RDF/OWL Exporting

This is a quite new framework, so many aspects need to be refined and improved.