fabian m. suchanek sofie: a self-organizing framework for information extraction 1 sofie: a...

32
SOFIE: A Self-Organizing Framework for Information Ex traction 1 Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum (Max-Planck-Institute for Informatics, Saarbrücken, Germany)

Upload: alexandra-ware

Post on 27-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 1Fabian M. Suchanek

SOFIE:A Self-Organizing

Frameworkfor Information Extraction

Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum

(Max-Planck-Institute for Informatics, Saarbrücken, Germany)

Page 2: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 2Fabian M. Suchanek

Ontologies

SingerCountry

USA

Entity

bornInPlace

typetype

subclassOfsubclassOf

Wikipedia

DBpedia,

YAGO,

KYLIN,

...

Internet

?"Elvis died in England"

birth-place: USA

Page 3: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 3Fabian M. Suchanek

Information Extraction

EnglanddiedInPlace

"Elvis died in England"

Previous approaches:

Espresso, DIPRE, LEILA, Snowball, TextRunner, Alice, and many more

Goal:

Extract ontological information from natural language documents

died in, perished in, was killed in,...

May deliver non-canonic relations ر

England, UK, Great Britain, ...

May deliver non-canonic entities ر

diedInPlace(Elvis,England)

diedInPlace(Elvis,Germany)

May deliver inconsistent facts ر

Page 4: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 4Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

OntologyWeb page

Louis XIV died in France.

FrancediedInPlace

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

Page 5: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 5Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

OntologyWeb page

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

"Elvis"

"England"diedInPlace

Page 6: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 6Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

OntologyWeb page

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

"Elvis"

"England"diedInPlace

Taxidophobist?

Page 7: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 7Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

Web page

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

"Elvis"

"England"diedInPlace

Taxidophobist

Reasoning Problem

Page 8: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 8Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

Web page

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

Taxidophobist

Reasoning Problem

Disambiguation Problem

Page 9: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 9Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.Louis XIV died in France.

Taxidophobist

Reasoning Problem

Disambiguation Problem

Pattern Matching Problem

"died in" = diedInPlace ?

Page 10: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 10Fabian M. Suchanek

Information Extraction as Formulas

type(Elvis,Taxidophobist).

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z) [0.8]

Taxidophobist

Reasoning Problem

Page 11: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 11Fabian M. Suchanek

Information Extraction as Formulas

Elvis died in England.Louis XIV died in France.

Reasoning Problem

Disambiguation Problem

Pattern Matching Problem

"died in" = diedInPlace ?

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z)

type(Elvis,Taxidophobist).

Page 12: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 12Fabian M. Suchanek

Assumptions:

In one document, the same word has always the same meaning ر

The ontology already knows all important meanings of proper رnames

possibleMeaning(Elvis@D15, ElvisPresley). [0.7]

Information Extraction as Formulas

Disambiguation Problem

Page 13: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 13Fabian M. Suchanek

Assumptions:

In one document, the same word has always the same meaning ر

The ontology already knows all important meanings of proper رnames

possibleMeaning(Elvis@D15, ElvisPresley). [0.7]

A word in context (wic).Here: The word "Elvis"

in document D15

One possible meaning of "Elvis" as given by the ontology

Prior estimation for the likelihood of this meaning.

Information Extraction as Formulas

| words(D15) ∩ rel(ElvisPresley)|

| words(D15) |

Page 14: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 14Fabian M. Suchanek

Assumptions:

In one document, the same word has always the same meaning ر

The ontology already knows all important meanings of proper رnames

possibleMeaning(Elvis@D15, ElvisPresley). [0.7]

Information Extraction as Formulas

possibleMeaning(X,Y) => means(X,Y)

means(X,Y) & YZ => means(X,Z)

Page 15: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 15Fabian M. Suchanek

Information Extraction as Formulas

Elvis died in England.Louis XIV died in France.

Reasoning Problem

Disambiguation Problem

Pattern Matching Problem

"died in" = diedInPlace ?

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z)

type(Elvis,Taxidophobist).

meaning(Elvis@D15,

ElvisPresley). [0.7]

Page 16: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 16Fabian M. Suchanek

Information Extraction as Formulas

Elvis died in England.Louis XIV died in France.

Pattern Matching Problem

"died in" = diedInPlace ?

occurs("died in",

Elvis@D15,

England@D15). [14]

occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & mapsTo(P,R)

=> R(X,Y)

occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & R(X,Y)

=> mapsTo(P,R)

Page 17: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 17Fabian M. Suchanek

Information Extraction as Formulas

Reasoning Problem

Disambiguation Problem

Pattern Matching Problem

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z)

type(Elvis,Taxidophobist).

meaning(Elvis@D15,

ElvisPresley). [0.7]

occurs("died in",

Elvis@D15,

England@D15). [14]

Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized

means(Elvis@D15, ElvisPresley) ?

mapsTo("died In", diedInPlace) ?

diedIn(ElvisPresley, England) ?

Page 18: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 18Fabian M. Suchanek

Weighted MAX SAT Problem

Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized

Problems:

The Weighted MAX SAT Problem is NP-hard ر

Our instance of the problem is huge ر

The most popular linear approximation algorithm ر(Johnson's) does not work well with our type of formulas

Weighted MAX SAT Problem

Johnson's cannot approximate better than 2/3

bornInPlace(X,Y) => bornInPlace(X,Z)

A v B A v C B v C

Page 19: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 19Fabian M. Suchanek

A v B [w1]

A v B [w2]

B v C [w3]

C [w4]

Formulas

A

B

C

Hypotheses

The Functional MAX SAT Algorithm considers only unit clauses.

= true

= false

= false

FMS Algorithm

The Functional MAX SAT Algorithm propagates Dominating Unit Clauses

A v B [10]

A [10]

A [30]

A = true30 > 10+10

Page 20: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 20Fabian M. Suchanek

FMS Algorithm

Experiments show better performance in practice than Johnson's algorithm

in our setting .

FMS Algorithm

FOR i=1 TO 42...NEXT i

Approximation

Guarantee

Polynomial

time

Page 21: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 21Fabian M. Suchanek

FMS Algorithm

FOR i=1 TO 42...NEXT i

FMS Algorithm

Elvis died in England r(X,Y) & s(Y) => t(X,Y)

Page 22: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 22Fabian M. Suchanek

England

FMS Algorithm

diedIn

St. Elvis

FMS Algorithm

FOR i=1 TO 42...NEXT i

Elvis died in England

type(Elvis,Taxidophobist)=1diedIn(Elvis,England)=0means(Elvis@D15,Elvis)=0means(Elvis@D15,...)=1

r(X,Y) & s(Y) => t(X,Y)

Page 23: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 23Fabian M. Suchanek

England

FMS Algorithm

diedIn

St. Elvis

FMS Algorithm

FOR i=1 TO 42...NEXT i

r(X,Y) & s(Y) => t(X,Y)

Page 24: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 24Fabian M. Suchanek

Corpus Type # Docs Relations Time Precision

Wikipedia toy corpus structured 100 3 2min 100%

Wikipedia subcorpus

semi-structured

2000 15 15h 94%

News article toy corpus

unstructured 150 1 24min 91%

Biographies from Web

unstructured 3440 5 15h 90%

Other Experiments

Page 25: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 25Fabian M. Suchanek

SOFIE unifies the tasks of

entity disambiguation ر

pattern extraction ر

semantic constraint reasoning ر

in a single framework, delivering

canonicalized facts ر

of high precision (experiments show 90% precision) ر

Conclusion

died in England... but is alive!

Page 26: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 26Fabian M. Suchanek

occurs(P,WX,WY)

/\ refersTo(WX.X)

/\ refersTo(WY,Y)

/\ R(X,Y)

=> expresses(P,R)

occurs(P,WX,WY)

/\ expressed(P,R)

/\ refersTo(WX.X)

/\ refersTo(WY,Y)

/\ range(R,D1)

/\ domain(R,D2)

/\ type(X,D1)

/\ type(Y,D2)

=> R(X,Y) R(X,Y)

R(X,Y)

/\ R(X,Z)

/\ type(R,function)

=> Y = Z

disambiguationPrior(W,X) => refersTo(W,X)

bornInYear(X,B) /\ diedInYear(X,D) => B<D

SOFIE rules!

Page 27: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 27Fabian M. Suchanek

SOFIE: Experiments

Corpus Type # Docs Relations Time Precision Recall

Wikipedia toy corpus

structured 100 3 8min 100% 80%

Wikipedia toy corpus

semi-structured 50% infoboxes removed

100 3 8min 100% 57%

Wikipedia subcorpus

semi-structured 2000 15 15h 94% ?

News article toy corpus

unstructured 150 1 24min 91% 24%, 31%

Snowball 56% 31%

Biographies from Web

unstructured 3440 5 15h 90% ?

Page 28: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 28Fabian M. Suchanek

SOFIE: Large-Scale Experiment

Goal:

Extract bornIn, bornOnDate, diedIn, diedOnDate, politicianOf

Corpus:

3700 biography documents downloaded from the Web

Runtime: (summed over 5 batches)

Parsing 7:05h

Hypothesis Generation 6:15h

Solving 2:30h

Total 15:50h

Results: (precision in %)

bornIn bornOnD diedIn diedOnD polOf

87 87 13 98 95 90

Page 29: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 29Fabian M. Suchanek

SOFIE: Relation to Markov Logic

P

bornIn(Nicholas, Patras)

false true

P(X) ~ e sat(i,X) wi

Number of satisfied instances of the ith formula

Weight of the ith formula

r(x,y) /\ s(x,z) => t(x,z) [w]

...

max X e sat(i,X) wi

max X log( e sat(i,X) wi )

max X sat(i,X) wi

~~~~> Weighted MAX SAT problem

Page 30: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 30Fabian M. Suchanek

Grounding

r(X,Y) & s(Y) => t(X,Y)

{ r(X,Y), s(Y), t(X,Y) }

{ r(a,a), s(a), t(a,a) }

{ r(a,b), s(b), t(a,b) }

{ r(b,a), s(a), t(b,a) }

{ r(b,b), s(b), t(b,b) }

r(a,a)

r(a,b)

r(b,a)

r(b,b)

Immutable, complete facts (e.g. pattern occurrences)

Entities={a,b}

Page 31: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 31Fabian M. Suchanek

Grounding

r(X,Y) & s(Y) => t(X,Y)

{ r(X,Y), s(Y), t(X,Y) }

{ s(a), t(a,a) } [w]

r(a,a) [w]

r(a,b)

r(b,a)

r(b,b)

Immutable, complete facts (e.g. pattern occurrences)

Page 32: Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction 1 SOFIE: A Self-Organizing Framework for Information Extraction Fabian

SOFIE: A Self-Organizing Framework for Information Extraction 32Fabian M. Suchanek

Grounding

{ s(a), t(a,a) } [w1]

{p(c,d), q(e), } [w2]

Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized

means(Elvis@D15, ElvisPresley) = true ?

mapsTo("died In", diedInPlace) = true ?

diedIn(ElvisPresley, England) = true ?