towards scalable information integration with instance coreferences

22
Towards Scalable Information Integration with Instance Coreferences Abir Qasem 1 , Dimitre Dimitrov 2 , Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 07/11/09 U.S. Department of Energy DE-FG02-05ER84171 SBIR grant

Upload: latika

Post on 02-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Towards Scalable Information Integration with Instance Coreferences. Abir Qasem 1 , Dimitre Dimitrov 2 , Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 07/11/09. U.S. Department of Energy DE-FG02-05ER84171 SBIR grant. The Semantic Web. Definition - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Scalable Information Integration with Instance Coreferences

Towards Scalable Information Integration with Instance Coreferences

Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1

1 Lehigh University2 Tech-X Corporation

07/11/09

U.S. Department of Energy DE-FG02-05ER84171 SBIR grant

Page 2: Towards Scalable Information Integration with Instance Coreferences

2 of 30

The Semantic Web

• Definition– The Semantic Web is not a separate Web but an extension of

the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. (Berners-Lee et al., Scientific American, May 2001)

• Ontology– a key component of the Semantic Web– ontologies define the semantics of the terms used in semi-

structured web pages• identify context, provide shared definitions• has a formal syntax and unambiguous semantics

– can be used to describe alignments between heterogeneous schemas

Page 3: Towards Scalable Information Integration with Instance Coreferences

3 of 30

A Web of Ontologies

Foaf

DBLP CongressCiteseer

AIGP NSF Awards

extends

extends

extends extends

S3

S7

commits to

commits to

commits to

The answer to a user’s query might require the combination of data from S1, S2, S3, and S4.

Region

S1 S2

Dublin Core

S5

extends

S4

S6

commits to

commits to commits to

extends

Page 4: Towards Scalable Information Integration with Instance Coreferences

4 of 30

Semantic Web Standards

• RDF(S) (1999, revised 2004)– essentially semantic networks

with URIs– XML serialization syntax

• OWL (2004)– extends RDF with more semantic

primitives– based on description logics (DLs)– has a model theoretic semantics

World Wide Web Consortium (W3C) Recommendations

u:Chair

John Smith

rdf:typeg:name

g:Person

g:name

rdfs:Class rdf:Property

rdf:typerdf:type

rdf:type

rdfs:subclassOf

rdfs:domain

<owl:Class rdf:ID=”Band”> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource=”#hasMember” /> <owl:allValuesFrom rdf:resource=”#Musician” /> </owl:Restriction> </rdfs:subClassOf></owl:Class>

A Band is a subset of the groups which only have Musicians as members

Page 5: Towards Scalable Information Integration with Instance Coreferences

AIGP - http://aigp.eecs.umich.edu/ DBLP - http://www.informatik.uni-trier.de/~ley/db/

aigp:researcher/show/93

aigp:researcher/show/21

“Eugene Charniak”

“Marvin Minsky”

aigp:name

aigp:advisorOf

aigp:name

QUERY: Find all academic papers written by Marvin Minsky’s advisees.

Integrating RDF Sources

dblp:c/Charniak:Eugene

dblp:jrnl/aim/Charniak97

“Eugene Charniak”

“Statistical Techniques for Natural Language Parsing”

dblp:name

dblp:hasAuthor

dblp:title

=?

Page 6: Towards Scalable Information Integration with Instance Coreferences

Coreference Information

• owl:sameAs– states that two URIs denote the same individual

• Linking Open Data initiative– ~100 sources with over 4 billion triples (i.e., facts)– >100 million explicit owl:sameAs statements

• Many RDF users publish owl:sameAs statements with their data

• Can use automated coreference resolution techniques to find others– allow for the possibility of human correction

Page 7: Towards Scalable Information Integration with Instance Coreferences

Scaling

• AIGP and DBLP have about 4000 coreferent instances

• Marvin Minsky has about 20 advisees• Only a small fragment of coreference

information is relevant to any given query– Need to be selective about what information

to use• Quantity of coreference information

– 80K between DBPedia and Geonames – 100K between CIA factbook and Geonames

Page 8: Towards Scalable Information Integration with Instance Coreferences

OBII

IndexKBLAV, GAV,(REL statements are LAV + URL of data source)

O1 On

Om1 Omn

Domainontologies

OWLII mapontologies

REL set

SPARQL Query

GNS

Data sources

S1

Result

Potentially relevant sources from the leaves

Retrieve potentially relevant sources and

load them in a reasoner

Potentially relevant sources

LAV/GAVmatches

http calls

System startupor periodic update

Query PhaseSemantic Web Space

S3

S2

S4S4 S5 Sn

EQKB

R1 Rn

LAV/GAV

Rcs and Rps to LAV/GAV

Rs to Indexed Equivalence closure

is ?

Get All

LAV/GAV

Page 9: Towards Scalable Information Integration with Instance Coreferences

Potential Relevance

• A summary of a source’s content that allows us to ignore sources that can not possibly contribute to a query

• Unless we look inside the source there is no way to guarantee its relevance

• REL statements have three forms stating relevance of three different assertions a source can have(In the following d is the URL of a data source, Cs is a class, CE is a class expression, Ps, Pq are property names, {u1 …. un} are a set of URIs)

– For Classes Rc the form is REL (d, Cs, CE)

– For properties Rp the form is REL (d, Ps, Pq)

– For owl:sameAs assertions R the form is REL (d, {u1 …. un})

Page 10: Towards Scalable Information Integration with Instance Coreferences

Information Integration vs. Source Selection

Information Integration

Source Selection

Data sources Queryable Lightweight

Query reformulation process

Match and expansion of rules with query atoms

Match and expansion of rules with query atoms

Query reformulation result

Conjunctive queries over sources

A set of “potentially relevant” sources

Obtaining the answer

Issue the queriesand union the results

Load the atomic sources into a reasoning engine and issue the original query

Page 11: Towards Scalable Information Integration with Instance Coreferences

Equivalence KB

• Implementation is a variation of disjoint set forest algorithm [Cormen et al. 01]– standard operations: union(x,y) and find-set(x)

• Also supports isEquivalent and getAllEquivalent methods• The index is built by an update algorithm (with a set of

seed URLs)• Uses an inverted document index for equivalence

relevance information

[Cormen et al. 01]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms: Second Edition. The MIT Press, Cambridge, MA, 2001.

Page 12: Towards Scalable Information Integration with Instance Coreferences

Update of EquivalenceKB

Page 13: Towards Scalable Information Integration with Instance Coreferences

Preliminary Tests

• We have used– 202,383 owl:sameAs statements that align data from

AIGP, DBLP and Citeseer data sources– Part of Hawkeye Project

• http://swat.cse.lehigh.edu/resources/index.html• 166 million facts and several “integration resources”

• PC with 3GB – EquivalanceKB is 7mb– Buildup time 3 seconds– 1000 calls to getAllEquivalents returns in less than

half a second

Page 14: Towards Scalable Information Integration with Instance Coreferences

Query Answering

Needs equivalence information

Page 15: Towards Scalable Information Integration with Instance Coreferences

GNS Extension

Needs equivalenceinformation

Page 16: Towards Scalable Information Integration with Instance Coreferences

contains is used before expansion to avoid cyclic expansion To avoid redundancy, we consider syntactic query containment

E.g., CONTAINS(cl, P(x,a)) is true if P(x,y) is in cl Equivalence information is relevant

author (X, GNS) in Closed list we should not expand author (X, GOAL-NODE-SEARCH)

assuming GNS = GOAL-NODE-SEARCH

GNS Extension

Page 17: Towards Scalable Information Integration with Instance Coreferences

GNS Extension

unifyEQ is like regular unify except it accounts for coreferences When matching two constants we use isEqual of Equivalence KB livesIn(X, DC) and livesIn (X, WashingtonDC) will not unify unless we know DC = WashingtonDC

Page 18: Towards Scalable Information Integration with Instance Coreferences

Conclusion and Future Work

• Scalable Instance Coreference Handling is an important issue

• Initial work shows promise• Two important issues

– Avoid pre-computation of equivalence closure and make the system more dynamic

– Disk based implementation of EquivalenceKB• We are currently fine tuning a dynamic algorithm

– UpdateEqualKB is not seeded with all URIs but rather with URIs from a query

– Equivalence information is updated as new URIs are discovered due to rule expansion

– Coming soon to a conference near you

Page 19: Towards Scalable Information Integration with Instance Coreferences

Backups

Page 20: Towards Scalable Information Integration with Instance Coreferences

OWLII in OWL/RDF

Axiom type Subject (left-hand side) Object (right-hand side)

owl:equivalentClass Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue

Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue

rdfs:subClassOf All of the above +

owl:unionOf

All of the above +

owl:allValuesFrom

owl:equivalentProperty

rdfs:subPropertyOf

named properties , owl:inverseOf

named properties , owl:inverseOf

owl:inverseOf named properties named properties

Page 21: Towards Scalable Information Integration with Instance Coreferences

Map example

O1:GreenTranpsort (X) :- O2:Transport (X), O2:greenRating(X, good)

<owl:Class rdf:about=“http://O1#GreenTransport”>

<rdfs:subClassOf rdf:resource=“http://O2#Transport”/>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource=“http://O2#greenRating”/>

<owl:hasValue rdf:resource= “http://uri#good”/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

Page 22: Towards Scalable Information Integration with Instance Coreferences

REL example

<meta:RelStatement> <meta:source rdf:resource=“http://U2”/> <meta:contained> <owl:Class rdf:about=“http://O1#MtnBike” /> </meta:contained> <meta:container>

<owl:Class rdf:about=“http://O1#GreenTransport” /> </meta:container>

</meta:RelStatement>

R4: O1:MtnBike (X) ⊑O1:GreenTransport(X) ,U2