sofie - a unified approach to ontology-based information extraction using reasonig

27
Copyright 2010 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.i e SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig Tobias Wunner Unit for Natural Language Processing (UNLP) [email protected] Wednesday,22 nd June, 2011 DERI, Reading Group 1

Upload: tobias-wunner

Post on 18-Dec-2014

665 views

Category:

Education


2 download

DESCRIPTION

The creation of new knowledge in the Semantic Web is more and more depending on a automatic knowledge enrichment processes, such semi-structural Information Extraction (IE) in the example of the creation of DBPedia from Wikipedia. To further improve knowledge coverage IE must also consider non-structural plain natural language text resources. Here SOFIE offers a novel approach to IE which can consistently enrich semantic models from text sources by combining pattern matching, entity disambiguation and reasoning in a propositional logic approach using MAX SAT in the IE process.

TRANSCRIPT

Page 1: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Copyright 2010 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

1

SOFIE - A Unified Approach To Ontology-Based Information Extraction

Using Reasonig

Tobias WunnerUnit for Natural Language Processing (UNLP)

[email protected]

Wednesday,22nd June, 2011

DERI, Reading Group

Page 2: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Based On:

“SOFIE: A Self-Organizing Framework for Information Extraction”

Authors: Fabian Suchanek, Mauro Sozio, Gerhard Weikum

Published: World Wide Web Conference (WWW)

Madrid, 2009

2

Page 3: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Overview

1. Introduction

2. SOFIE Model + Rules

3. Excursion: Satisfiability

4. SOFIE Approach

5. Evaluation experiments

6. Conclusion

3

Page 4: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Motivation

Classical IE on text pattern-based 80pc

Semistructural approach Wikipedia infoboxes 95%

Idea of Paper: combine

use text (hypotheses) + ontology (trusted facts)

4

Page 5: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Example

5

Einstein attended secondary school in Germany.

Document1

YAGO ontology

familyName(AlbertEinstein, Einstein)bornIn(AlbertEinstein, Germany)

New Knowledge

attendedSchoolIn( AlbertEinstein, Germany)

Page 6: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

General Idea

Express extraction patterns as fact

Rules to understand usage of terms

Add restrictions

6

patternOcc(“X went to school in Y”,Einstein, Switzerland)

patternOcc(Pattern,X,Y) and R(X,Y) ⇒ express(Pattern,R)

Page 7: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Contribution

Unified approach toPattern matching

Word Sense Disambiguation

Reasoning

Large ScaleOn Unstructured Data

7

Page 8: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Pattern extraction with WICs

Extract patterns based on ‘interesting’ entities

8

Einstein was born at Ulm in Württemberg, Germany, on March 18, 1879. When Albert was around four, his father gave him a magnetic compass.

When Albert became older, he went to a school in Switzerland. After he graduated, he got a job in the patent office there…

Documents

patternOcc(“Einstein was born in Ulm”,Einstein@D1, Ulm@D1) [1]

patternOcc(“Ulm is in Württemberg, Germany”,Ulm@D1, Germany@D1) [1]patternOcc(“Albert .. Switzerland”,Albert@D1, Switzerland@D1) [1]

Knowledge Base

WICs (Word in Context)

Page 9: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Grounding

Test Rules How?

find an instance which satisfies the formulae

9

bornIn(X,Ulm) ⇒ ¬bornIn(X,Timbuktu)

studiedIn(X,Ulm)

bornIn(Einstein,Ulm) ⇒ ¬bornIn(Einstein,Timbuktu)

studiedIn(Einstein,Ulm)

Page 10: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Rules (Hypotheses)

Disambiguation

– disambiguatesAs(Albert@D,AlberEinstein)[?]

Expresses a new fact

– expresses(P, livedIn(Einstein,Switzerland) )[?]

New facts

– CityIn(Ulm,Germany)[?]

10

Page 11: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

New fact rule

...with disambiguation

11

patternOcc( P, WX, WY ) and

disambiguatesAs(WX, X) and

disambiguatesAs(WY, Y) and

R(X,Y)

⇒ express( P, R )

“Pattern P expresses

Relation R when

analysis of WICs

are

disambiguated”

Page 12: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Restrictions

Disambiguation disambiguation prior should influence

choice of disambiguation

12

disambPrior( W, X, N )

⇒ disambiguatedAs( W, X ) | words(D1) ∩ rel(AlbertEinstein)|

N - any disamb. function

| words(D1) |

Page 13: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Restrictions

Functional restrictions

13

R(X,Y) and

type(R, function) and

different(Y,Z)

⇒ ¬R(X,Z)

Albert@D1 ≠ Albert@D2

“Albert@D1 born in?”

Page 14: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

SOFIE Rules

Framework to test the hypotheses Question

“How to satisfy all them?”

rules + trusted facts

14

patternOcc( P, X, Y ) and

R(X,Y)

⇒ express( P, R )

dismbPrior(Albert@D1, HermannEinstein, 3)

⇒ disambiguatesAs(Albert@D1,

HermannEinstein)dismbPrior(Albert@D1, AlbertEinstein, 10)

⇒ disambiguatesAs(Albert@D1, AlbertEinstein)

Country(Germany)

livedIn(AlbertEinstein,Ulm)

Page 15: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

SAT / MAX SAT

SAT (Satisfiability) proove formula can be TRUE

Complexity Classes P Good example: Nk

NP Bad cN

– e.g. naive algorithm for 100 variables

2100 x 10-10 ms per row = 4 x 1012 y

– Not always.. 3SAT in (4/3)N

– SAT Solver

15

X Y Z F

0 0 0 0

0 0 1 1

0 1 0 1

0 1 1 0

1 0 0 0

1 0 1 1

1 1 0 1

1 1 1 0

truth table has 23 rows

F = (X or Y or Z) and (¬X or Y or Z)

and (¬X or ¬Y or ¬Z)G = (X or Y) and (¬X or ¬Y) and (X)

Details Schöning 2010

Page 16: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

SAT / MAX SAT

SAT (Satisfiability) proove formula can be TRUE

Complexity Classes P Good example: Nk

NP Bad cN

– e.g. naive algorithm for 100 variables

2100 x 10-10 ms per row = 4 x 1012 y

– Not always.. 3SAT in (4/3)N

– SAT Solver

MAX SAT

16

F = (X or Y or Z) and (¬X or Y or Z)

and (¬X or ¬Y or ¬Z)G = (X or Y) and (¬X or ¬Y) and (X)

X Y Z F

0 0 0 0

0 0 1 1

0 1 0 1

0 1 1 0

1 0 0 0

1 0 1 1

1 1 0 1

1 1 1 0

truth table has 23 rows

X Y G #clauses

0 0 0 1

0 1 0 2

1 0 0 3

1 1 0 2

Details Schöning 2010

Page 17: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Weighted MAX SAT in SOFIE

...back to SOFIE

this is MAX SAT but with weights

17

patternOcc( P, X, Y ) and

R(X,Y)

⇒ express( P, R )

dismbPrior(Albert@D1, HermannEinstein, 3)

⇒ disambiguatesAs(Albert@D1, HermannEinstein)

dismbPrior(Albert@D1, AlbertEinstein, 10)

⇒ disambiguatesAs(Albert@D1, AlbertEinstein)

Country(Germany)

livedIn(AlbertEinstein,Ulm)

rules + trusted facts

Page 18: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Weighted MAX SAT in SOFIE

Weighted MAX SAT is NP hard only approximation algorithms

impractical to find optimal solution

SAT Solver Johnson’s algorithm: 2/3 (apprx guarantee)

Page 19: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Weighted MAX SAT in SOFIE

Functional MAX SAT

Specialized reasoning (support for functional properties)

Approximation guarantee 1/2

A v B [w1]

A v B [w2]

B v C [w3]

C [w4]

Considers only unit clauses

Propagates dominating unit clauses

A v B [10]

A [10]

A [30]

A = true

30 > 10+10

Page 20: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Controlled experiment

Corpus from Wikipedia infoboxes 100 articles

Semantic is known!

20

Page 21: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Controlled experiment

Large-scale: Corpus from Wikipedia articles 2000 articles

13 frequent relations from YAGO

Parsing = 87min Reaoning = 77min

21

Page 22: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Unstructured text sources

150 news paper articles relation under test headquarterOf

YAGO (modified with relation seeds)

Parsing 87min WeightedMaxSat 77min

disambiguated entries (provenance) could be manually assessed

22

functionalrelation

Page 23: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Unstructured text sources

Large-scale: 10 biographies for each of 400 US senators

5 relationships

Disambiguation was not ideal for YAGO (13 James Watson)

Parsing 7h W-MAX-SAT 9h

Results

– 4 good

– 1 bad (misleading patterns)

23

Page 24: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Reformulate OWL in propositional logic OWL FOL Skolem Normal Form Propositional Logic

Might find OWL-inconsistent ontologies due to OW Assumption

MAX SAT can’t do OWL per se (Open World Assumption)

24

define a student as a subclass “attends some course”

⇒ ∀ x, ∃ y: attends(x,y), Course(y) → Student(y)

⇒ ∀ x: attends(x,k), Course(y) → Student(y); ∃ k

⇒ ¬attends(xi, ki) or ¬Course(xi) or Student(xi); k=x1 .. xn

Inferred Ontology

{ Student(alex), Student(bob), Student subClassOf attends some Course, attends(alex, SemanticWeb) }

Details JMC 2010

Page 25: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Conclusions

Ontology-based IE (OBIE) reformulated as

weighted MAX SAT problem

Approximation algorithm with 1/2

Works and scales (large corpus + YAGO)

25

Page 26: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

Limitations

Specialized approximation algorithm– Accounts for SOFIE rules NOT OWL

MAX SAT Restrictions∈ Prepositional Logic

∉ First-Order Logic

Ontology population approach (can’t infer new relations)

26

Page 27: SOFIE - A Unified Approach To Ontology-Based Information Extraction Using Reasonig

Digital Enterprise Research Institute www.deri.ie

References

27

1. F Suchanek et al, SOFIE: a self-organizing framework for information extraction, Proceeding WWW '09 Proceedings of the 18th international conference on World wide web, link

2. John McCrae, Automatic Extraction Of Logically Consistent Ontologies From Text, PhD thesis at National Institute of Informatics, Japan, 2009 link

3. Uwe Schöning: Das SAT-Problem. In Informatik Spektrum 33(5): 479-483, 2010, link

4. F Suchanek, Automated Construction and Growth of a Large Ontology, PhD thesis at Technology of Saarland University. Saarbrücken, Germany, 2009, link