schemex -- building an index for linked open data

45
Slide 1 of 44 SchemEX – Building an Index for LOD SchemEX – Building an Index for Linked Open Data Ansgar Scherp, Thomas Gottron, Mathias Konrath University of Koblenz-Landau, Germany Oslo, Norway August 2012

Upload: ansgar-scherp

Post on 10-May-2015

1.583 views

Category:

Education


1 download

DESCRIPTION

General introduction to Linked Open Data and schema extraction using SchemEX. Download full slide set to enjoy all animations.

TRANSCRIPT

Page 1: SchemEX -- Building an Index for Linked Open Data

Slide 1 of 44SchemEX – Building an Index for LOD

SchemEX – Building an Index for Linked Open Data

Ansgar Scherp, Thomas Gottron, Mathias KonrathUniversity of Koblenz-Landau, Germany

Oslo, NorwayAugust 2012

Page 2: SchemEX -- Building an Index for Linked Open Data

Slide 2 of 44SchemEX – Building an Index for LOD

Learning Goals

• Understand the motivation and fundamentals of Linked Open Data (LOD).

• Qualify in why an index for LOD is needed and how to efficiently create such an index.

Page 3: SchemEX -- Building an Index for Linked Open Data

Slide 3 of 44SchemEX – Building an Index for LOD

Scenario

• Tim plans to travel– from London – to a customer in Cologne

Page 4: SchemEX -- Building an Index for Linked Open Data

Slide 4 of 44SchemEX – Building an Index for LOD

Website of the German Railway

It works, why bother…?

Eurostar

DB

KVB

Page 5: SchemEX -- Building an Index for Linked Open Data

Slide 5 of 44SchemEX – Building an Index for LOD

Let‘s Try Different Queries

Bottlenecks in public transportation? Compare the connections with flights? Visualize on a map? …

All these queries cannot be answered, because the data …

Page 6: SchemEX -- Building an Index for Linked Open Data

Slide 6 of 44SchemEX – Building an Index for LOD

… locked in Silos!

– High Integration Effort– Lack in Reuse of Data

B. Jagendorf, http://www.flickr.com/photos/bobjagendorf/, CC-BY

Page 7: SchemEX -- Building an Index for Linked Open Data

Slide 7 of 44SchemEX – Building an Index for LOD

Linked Data

• Publishing and interlinking of data• Different quality and purpose• From different sources in the Web

World Wide Web Linked DataDocuments DataHyperlinks Typed LinksHTML RDFAddresses (URIs) Addresses (URIs)

Example: http://www.uio.no/

Page 8: SchemEX -- Building an Index for Linked Open Data

Slide 8 of 44SchemEX – Building an Index for LOD

Relevance of Linked Data?

Page 9: SchemEX -- Building an Index for Linked Open Data

Slide 9 of 44SchemEX – Building an Index for LOD

Linked Data: May ‘07

< 31 Billion Triples

Media

Geographic

Publications

Web 2.0

eGovernment

Cross-Domain

LifeSciences

Sept. ‘11

Source: http://lod-cloud.net

Page 10: SchemEX -- Building an Index for Linked Open Data

Slide 10 of 44SchemEX – Building an Index for LOD

Linked Data Principles

1. Identification2. Interlinkage3. Dereferencing4. Description

Page 11: SchemEX -- Building an Index for Linked Open Data

Slide 11 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

Big LynxCompany

Matt Briggs

Scott Miller

Source: http://lod-cloud.net< 31 Milliarde Triple

?

Page 12: SchemEX -- Building an Index for Linked Open Data

Slide 12 of 44SchemEX – Building an Index for LOD

1. Use URIs for Identification

B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY

http://biglynx.co.uk/people/matt-briggs

http://biglynx.co.uk/people/scott-miller

Matt Briggs

Scott Miller

Page 13: SchemEX -- Building an Index for Linked Open Data

Slide 13 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

Big LynxCompany

Matt Briggs

Scott Miller

How to model relationships like knows?

Page 14: SchemEX -- Building an Index for Linked Open Data

Slide 14 of 44SchemEX – Building an Index for LOD

• Description of Ressources with RDF triple

Matt Briggs is a Person

Resource Description Framework (RDF)

Subject Predicate Object

@prefix rdf:<http://w3.org/1999/02/22-rdf- syntax-ns#> .

@prefix foaf:<http://xmlns.com/foaf/0.1/> .

<http://biglynx.co.uk/people/matt-briggs> rdf:type foaf:Person .

Page 15: SchemEX -- Building an Index for Linked Open Data

Slide 15 of 44SchemEX – Building an Index for LOD

1. Use URIs also for Relations

B. Gazen,http://www.flickr.com/photos/bayat/, CC-BY

http://biglynx.co.uk/people/matt-briggs

http://biglynx.co.uk/people/scott-miller

foaf:knows

foaf:knows

Page 16: SchemEX -- Building an Index for Linked Open Data

Slide 16 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

Big Lynx Company

DBpedia

Matts privateWebseite

„sameperson“

Matt Briggs

„lives here“

Dave SmithLondon

Matt Briggs

Scott Miller

Page 17: SchemEX -- Building an Index for Linked Open Data

Slide 17 of 44SchemEX – Building an Index for LOD

2. Establishing Interlinkage

• Relation links between ressources

<http://biglynx.co.uk/people/dave-smith> foaf:based_near <http://dbpedia.org/resource/London> .

Identity links between ressources

<http://biglynx.co.uk/people/matt-briggs> owl:sameAs <http://www.matt-briggs.eg.uk#me> .

Page 18: SchemEX -- Building an Index for Linked Open Data

Slide 18 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

DBpedia

Big Lynx Company

Matts privateWebseite

foaf:based_near

owl:sameAs

Matt Briggs

Dave SmithLondon

Matt Briggs

„same Person“

„lives here“

Page 19: SchemEX -- Building an Index for Linked Open Data

Slide 19 of 44SchemEX – Building an Index for LOD

• Looking up of web documents

• How can we “look up” things of the real world?

3. Dereferencing of URIs

http://biglynx.co.uk/people/matt-briggs

Page 20: SchemEX -- Building an Index for Linked Open Data

Slide 20 of 44SchemEX – Building an Index for LOD

Two Approaches

1. Hash URIs–URI contains a part separated by #, e.g.,

http://biglynx.co.uk/vocab/sme#Team

2. Negotiation via „303 See Other“ request

http://biglynx.co.uk/people/matt-briggs

Response: „Look here:“http://biglynx.co.uk/people/matt-briggs.rdf

Page 21: SchemEX -- Building an Index for Linked Open Data

Slide 21 of 44SchemEX – Building an Index for LOD

Example: Big Lynx

DBpedia

Big Lynx Company

Matts privateWebseite

foaf:based_near

owl:sameAs

Matt Briggs

Dave SmithLondon

Matt BriggsDescription of Matt?

Page 22: SchemEX -- Building an Index for Linked Open Data

Slide 22 of 44SchemEX – Building an Index for LOD

4. Description of URIs

biglynx:matt-briggs

foaf:Person

biglynx:dave-smith

dp:Birmingham

rdf:type

foaf:knows

foaf:based_near

_:point

wgs84:lat

wgs84:long

dp:London

foaf:based_near

……

ex:loc

“-0.118”

“51.509”

Page 23: SchemEX -- Building an Index for Linked Open Data

Slide 23 of 44SchemEX – Building an Index for LOD

RDF / RDF Schema Vocabulary

• rdf:type • rdf:Property • rdf:XMLLiteral • rdf:List • rdf:first • rdf:rest • rdf:Seq • rdf:Bag • rdf:Alt • ... • rdf:value

• rdfs:domain • rdfs:range • rdfs:Resource • rdfs:Literal • rdfs:Datatype • rdfs:Class • rdfs:subClassOf • rdfs:subPropertyOf • rdfs:comment • …• rdfs:label

• Set of URIs defined in rdf:/rdfs: namespace

Page 24: SchemEX -- Building an Index for Linked Open Data

Slide 24 of 44SchemEX – Building an Index for LOD

Semantic Web Layer Cake (Simplified)

Page 25: SchemEX -- Building an Index for Linked Open Data

Slide 25 of 44SchemEX – Building an Index for LOD

Learning Goals

• Understand the motivation and fundamentals of Linked Open Data (LOD).

• Qualify in why an index for LOD is needed and how to efficiently create such an index.

Page 26: SchemEX -- Building an Index for Linked Open Data

Slide 26 of 44SchemEX – Building an Index for LOD

Scenario• People who are politicians and actors

• Who else? • Where do they live? • Whom do they know? …are they married with?

Page 27: SchemEX -- Building an Index for Linked Open Data

Slide 27 of 44SchemEX – Building an Index for LOD

Problem

• Execute those queries on the LOD cloud

Relevant sources?

• No single federated query interface provided

“politicians and actors”

SELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician .}

Page 28: SchemEX -- Building an Index for Linked Open Data

Slide 28 of 44SchemEX – Building an Index for LOD

Principle Solution

• Suitable index structure for looking up sources

“politicians and actors”

Page 29: SchemEX -- Building an Index for Linked Open Data

Slide 29 of 44SchemEX – Building an Index for LOD

The Naive Approach

1. Download the entire LOD cloud2. Put it into a (really) large triple store3. Process the data and extract schema4. Provide lookup

- Big machinery- Late in processing the data- High effort to scale with LOD cloudCan we do smarter?

Can we do smarter?

Page 30: SchemEX -- Building an Index for Linked Open Data

Slide 30 of 44SchemEX – Building an Index for LOD

Idea Schema-level index

Define families of graph patternsAssign instances to graph patternsMap graph patterns to context (source URI)

ConstructionStream-based for scalabilityLittle loss of accuracy

Note Index defined over instancesBut stores the context

Page 31: SchemEX -- Building an Index for Linked Open Data

Slide 31 of 44SchemEX – Building an Index for LOD

Input Data n-Quads

<subject> <predicate> <object> <context>

Example: <http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf>

w3p:#me

foaf:Person

http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf

Page 32: SchemEX -- Building an Index for Linked Open Data

Slide 32 of 44SchemEX – Building an Index for LOD

DS1 DS2 DS3 DS4 DS5 DSxData

sources

consistsOf

hasDataSource

Building the Schema and Index

EQC1 EQC2 EQCn Equivalenceclasses

hasEQClass p1 p2

TC1 TC2 TCm

Type clusters…

C2C1 C3 CkRDF

classes…

Page 33: SchemEX -- Building an Index for Linked Open Data

Slide 33 of 44SchemEX – Building an Index for LOD

Layer 1: RDF Classes

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

http://dig.csail.mit.edu/2008/...

C1

DS 3DS 2DS 1

SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person .}

All instances of a particular type

Page 34: SchemEX -- Building an Index for Linked Open Data

Slide 34 of 44SchemEX – Building an Index for LOD

Layer 2: Type Clusters

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

DS 3DS 2DS 1SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male .}

C1

C2

pim:Male

tc4711

pim:Male

All instances belonging to exactly the same set of types

TC1

Page 35: SchemEX -- Building an Index for Linked Open Data

Slide 35 of 44SchemEX – Building an Index for LOD

Two instances are equivalent iff:They are in the same TCThey have the same

propertiesThe property targets are

in the same TC

Similar to 1-Bisimulation

Layer 3: Equivalence Classes

EQC1

DS 3DS 2DS 1

C1

C2

TC1

C3

TC2

p

Page 36: SchemEX -- Building an Index for Linked Open Data

Slide 36 of 44SchemEX – Building an Index for LOD

Layer 3: Equivalence Classes

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

pim:Male

tc4711

pim:Male

foaf:PPD

timbl:card

eqc0815

foaf:PPD

tc1234

eqc0815-maker-tc1234

foaf:maker

SELECT ?x WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}

Page 37: SchemEX -- Building an Index for Linked Open Data

Slide 37 of 44SchemEX – Building an Index for LOD

The SchemEX Approach• Stream-based schema extraction• While crawling the data

Nquad-Stream

Schema-LevelIndex

Schema-Extractor

Parser

Instance-Cache

LOD-CrawlerRDF-DumpTriple Store

NxParser

RDFRDBMS

FIFO

Page 38: SchemEX -- Building an Index for Linked Open Data

Slide 38 of 44SchemEX – Building an Index for LOD

Building the Index from a Stream Stream of n-quads (coming from a LD crawler)

… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1

FiFo

4

3

2

1

1

6

23

4

5

C3

C2

C2

C1

• Linear runtime complexity wrt # of input triples

Page 39: SchemEX -- Building an Index for Linked Open Data

Slide 39 of 44SchemEX – Building an Index for LOD

Computing SchemEX: TimBL Data Set

• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile• LDspider with ~ 2k triples / sec

• Different cache sizes: 100, 1k, 10k, 50k, 100k• Compared SchemEX with reference schema• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+

Page 40: SchemEX -- Building an Index for Linked Open Data

Slide 40 of 44SchemEX – Building an Index for LOD

Quality of Stream-based Index Construction

• Runtime increases hardly with window size• Memory consumption scales with window size

Page 41: SchemEX -- Building an Index for Linked Open Data

Slide 41 of 44SchemEX – Building an Index for LOD

Computing SchemEX: Full BTC 2011 Data

Cache size: 50 k

Page 42: SchemEX -- Building an Index for Linked Open Data

Slide 42 of 44SchemEX – Building an Index for LOD

Billion Triple Challenge 2011

Page 43: SchemEX -- Building an Index for Linked Open Data

Slide 43 of 44SchemEX – Building an Index for LOD

Conclusions: SchemEX

• Linked Open Data (LOD) approach• Publishing and interlinking data on the web

• SchemEX• Stream-based approach to LOD schema

extraction • Scalable to arbitrary amount of Linked Data • Applicable on commodity hardware

(4GB RAM, single CPU)

Page 44: SchemEX -- Building an Index for Linked Open Data

Slide 44 of 44SchemEX – Building an Index for LOD

Learning Goals

• Understand the motivation and fundamentals of Linked Open Data (LOD).

• Qualify in why an index for LOD is needed and how to efficiently create such an index.

Page 45: SchemEX -- Building an Index for Linked Open Data

Slide 45 of 44SchemEX – Building an Index for LOD

Recommended Readings

• Maciej Janik, Ansgar Scherp, Steffen Staab: The Semantic Web: Collective Intelligence on the Web. Informatik Spektrum 34(5): 469-483 (2011)URL: http://dx.doi.org/10.1007/s00287-011-0535-x

• Mathias Konrath, Thomas Gottron, Steffen Staab, Ansgar Scherp: SchemEX — Efficient construction of a data catalogue by stream-based indexing of linked data, J. of Web Semantics: Science, Services and Agents on the World Wide Web, Available online 23 June 2012URL: http://www.sciencedirect.com/science/article/pii/S1570826812000716

• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space, Morgan & Claypool Publishers, 2011URL: http://dx.doi.org/10.2200/S00334ED1V01Y201102WBE001