using semantic web technology to integrate scientific data alasdair j g gray university of...

54
Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

Upload: lucy-terry

Post on 18-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

Using Semantic Web Technology to Integrate Scientific Data

Alasdair J G GrayUniversity of Manchester

Page 2: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 2

Outline

• Motivation: Astronomy & The Virtual Observatory• Data Integration Challenges

1. Locating the relevant data

2. Requesting the required data

3. Understanding the returned data

23 June 2009

Using Semantic Web tools and technology to overcome these challenges

Page 3: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 3

Context: Astronomy

23 June 2009

• Data collected across electromagnetic spectrum

• Analysed within one wavelength

• Data collection is – expensive– time consuming

• Existing data– large quantities– freely available

Image: Wikipedia

Page 4: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 4

Virtual Observatory“facilitate the international

coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.”

23 June 2009

Page 5: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 5

Searching for Brown Dwarfs• Data sets:

– Near Infrared, 2MASS/UK Infrared Deep Sky Survey– Optical, APMCAT/Sloan Digital Sky Survey

• Complex colour/motion selection criteria• Similar problems

– Halo White Dwarfs

23 June 2009Image: AstroGrid

Page 6: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 6

Deep Field Surveys

• Observations in multiple wavelengths– Radio to X-Ray

• Searching for new objects– Galaxies, stars, etc

• Requires correlations across many catalogues– ISO– Hubble– SCUBA– etc

23 June 2009

Page 7: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 7

Locate, retrieve, and interpret relevant data

• Heterogeneous publishers– Archive centres– Research labs

• Heterogeneous data– Relational– XML– Image Files

23 June 2009

Virtual Observatory

Virtual Observatory: The Problems

Page 8: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 8

Virtual Observatory: The Problems

Locate, retrieve, and interpret relevant data

1. Which data sources contain relevant data?

2. How do I query the relevant data sources?

3. How can I interpret/combine/analyse the data?

23 June 2009

Virtual Observatory

Page 9: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 9

Finding relevant data sources

1. Which data sources contain relevant data?

23 June 2009

Page 10: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 10

Which data sources do I use?• VO registry

– 65,000+ entries– Many mirrored services

• VOExplorer– Registry search tool

• Resources tagged with keywords

23 June 2009

- 6df- survey- galaxy- galaxies

- redshift- redshifts- 2mass

Page 11: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 11

Analysis of Registry KeywordsProblems:– Plural/singular– Case– Abbreviations– Different tags– Specificity of tags

Thanks to Sébastien Derriere for this data.

23 June 2009

75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of

Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar

4 supernova 3 Clusters of

Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic

Source 2 Extragalactic

objects 2 Infrared: stars 2 Interstellar

medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of

galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary

stars 1 Binary stars ...

Page 12: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 12

Analysis of Registry Keywords

Problems:– Plural/singular– CaseSolution:(standard IR techniques)

– Stemming• Star & Stars

become Star

– Case normalisation• lowercase

23 June 2009

75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of

Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar

4 supernova 3 Clusters of

Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic

Source 2 Extragalactic

objects 2 Infrared: stars 2 Interstellar

medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of

galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary

stars 1 Binary stars ...

Page 13: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 13

Analysis of Registry Keywords

Problems:– Abbreviations– Different tags– Specificity of tags Solution:

Need to understand semantics!

23 June 2009

75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of

Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar

4 supernova 3 Clusters of

Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic

Source 2 Extragalactic

objects 2 Infrared: stars 2 Interstellar

medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of

galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary

stars 1 Binary stars ...

Page 14: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 14

Semantic Options• Folksonomies

– Keyword tags, freely chosen

• Vocabulary– Controlled list of words with definitions

• Taxonomy– Relationships:

Broader/Narrower/Related

• Thesaurus– Synonyms, antonyms, see also

• Ontology– Formal specification of a shared

conceptualisation – OWL

“Vocabulary” used to cover vocabularies, taxonomies, and thesauri.

23 June 2009

Ima

ge:

Leo

nard

Co

hen

Sea

rch

Page 15: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 15

What is a Vocabulary?

23 June 2009

Page 16: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 16

What is a Controlled Vocabulary?• A set of terms with:

– Label– Synonyms– Definition– Relationships to other

terms:• Broader term• Narrower term• Related term

• Example:– “Spiral galaxy”– “Spiral nebula”– “A galaxy having a spiral structure”– Relationships carrying semantic

information:• BT: “Galaxy”• NT: “Barred spiral galaxy”• RT: “Spiral arm”

23 June 2009

Page 17: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 17

Existing Vocabularies in Astronomy

• Journal Keywords– Developed for tagging

papers– 311 terms– Actively used

• Astronomy Visualization Metadata (AVM)– Tagging images– 217 terms– Actively used

• IAU Thesaurus– Developed for libraries in

1993– 2,551 terms– Never really used

• Unified Content Descriptor (UCD)– Tagging resource data– 473 terms– Actively used

23 June 2009

Page 18: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 18

Common Vocabulary FormatRequirements:

– Provide term identifiers • Unambiguous tagging

– Capture semantic relationships

• Poly-hierarchy structure

– Machine processable• Allows inter-operability• “Machine intelligence”

– Avoids problems of:• Spelling• Case• Plurality problems• Tags

– Automated reasoning:• Interested in all “Supernova”• Items tagged as “1a Supernova”

also returned

23 June 2009

Page 19: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 19

SKOS– W3C standard for sharing vocabularies– Based on RDF

• Semantic model for describing resources– Provides URI for each term– Captures properties of terms– Encodes relationships between terms

• Enables automated reasoning• Standard serialisations• “Looser” semantics than OWL

– Adopted by IVOA as a standard for vocabularies23 June 2009

Page 20: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 20

Example SKOS Vocabulary Term

Example“Spiral galaxy”“Spiral nebula”“A galaxy having a spiral

structure”Relationships:

BT: “Galaxy”NT: “Barred spiral

galaxy”RT: “Spiral arm”

In turtle notation#spiralGalaxy a concept; prefLabel “Spiral galaxy”@en; altLabel “Spiral nebula”@en; definition “A galaxy having a

spiral structure”@en; broader #galaxy; narrower #barredSpiralGalaxy; related #spiralArm .

23 June 2009

Page 21: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 21

Inter-operable Vocabularies

Which vocabulary should I use?

• One that you know!• Closest match to your

needs• Vocabulary terms related

using mappings– Part of the SKOS standard– One mapping file per pair

of vocabularies

Inter-vocabulary mappings

• Broad match: – more general term

• Narrow match: – more specific term

• Related match: – associated term

• Exact match: – equivalent term

• Close match: – similar but not equivalent term

23 June 2009

Page 22: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 22

Mapping Editor

23 June 2009

Page 23: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 23

Putting it all together• Use vocabulary concepts for

– Tagging (using URI)• Resources in the registry• VOEvent packets

– Searching by vocabulary concept• User keyword search converted to vocabulary URI

• Provides semantic advantages– Reasoning about terms

• Relationships (Intra-vocabulary)• Mappings (Inter-vocabulary)

• Requires a mechanism to convert a string to a concept

23 June 2009

Page 24: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 24

Vocabulary Explorer• Search and browse

vocabularies– Configure

• Vocabularies• Mappings

• Based on Information Retrieval techniques• Matching mechanisms• Ranking results

23 June 2009

http://explicator.dcs.gla.ac.uk/WebVocabularyExplorer/

Page 25: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 25

Search Results

Run BB2 BM25 DFR-BM25

IFB2 In-expB2

In-expC2

InL2 PL2 TF-IDF

Initial 0.93 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95

Query Expansion

0.93 0.94 0.94 0.95 0.95 0.94 0.95 0.94 0.94

Term weighting 1

0.93 0.95 0.95 0.95 0.95 0.95 0.95 0.96 0.96

Term weighting 2

0.93 0.95 0.95 0.96 0.96 0.95 0.96 0.96 0.96

Combined 0.91 0.94 0.94 0.94 0.94 0.93 0.94 0.94 0.94

23 June 2009

•Evaluation over 59 queries•nDCG evaluation model (distinguishes highly relevant/relevant/not relevant)

Page 26: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 26

Vocabulary Explorer Screenshot

23 June 2009

Page 27: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 27

Vocabulary Explorer Screenshot

23 June 2009

Page 28: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 28

Vocabulary Explorer Screenshot

23 June 2009

Page 29: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 29

Vocabulary Explorer Screenshot

23 June 2009

Page 30: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 30

Finding the Right Term: Conclusions

• Vocabularies improve search– Remove ambiguity– Increase precision and recall– Enable

• Reasoning about relevance• Faceted browsing

• Provided tools for working with vocabularies– Reliable search from keyword string to vocabulary term– Exploration of vocabularies– Mapping terms across vocabularies

23 June 2009

Page 31: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 31

Extracting relevant data

2. How do I query the relevant data sources?

23 June 2009

Page 32: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 32

Locate, retrieve, and interpret relevant data

• Heterogeneous publishers– Archive centres– Research labs

• Heterogeneous data– Relational– XML– Image Files

23 June 2009

Virtual Observatory

Virtual Observatory: The Problems

Page 33: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 33

A Data Integration Approach• Heterogeneous sources

– Autonomous – Local schemas

• Homogeneous view– Mediated global schema

• Mapping– LAV: local-as-view– GAV: global-as-view

23 June 2009

Global Schema

Query1 Queryn

DB1

Wrapper1

DBk

Wrapperk

DBi

Wrapperi

Mappings

Relies on agreement of a common global

schema

Page 34: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 34

P2P Data Integration Approach• Heterogeneous sources

– Autonomous – Local schemas

• Heterogeneous views– Multiple schemas

• Mappings– From sources to common

schema– Between pairs of schema

• Require common integration data model

Can RDF do this?23 June 2009

Schema1

DB1

Wrapper1

DBk

Wrapperk

DBi

Wrapperi

Schemaj

Query1 Queryn

Mappings

Page 35: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 35

Resource Description Framework

• W3C standard• Designed as a metadata

data model• Contains semantic details• Ideal for linking

distributed data• Queried through SPARQL

23 June 2009

#foundIn

#Sun

The Sun#name

#MilkyWay

Milky Way#name

The Galaxy#name

IAU:Starrdf:type

IAU:BarredSpiral

rdf:type

Page 36: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 36

SPARQL

• Declarative query language– Select returned data (projection)

• Graph or tuples• Attributes to return

– Describe structure of desired results– Filter data (selection)

• W3C standard• Syntactically similar to SQL

– Should be easy for astronomers to learn!23 June 2009

Page 37: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 37

Integrating Using RDF• Data resources

– Expose schema and data as RDF

– Need a SPARQL endpoint

• Allows multiple – Access models– Storage models

• Easy to relate data from multiple sources

23 June 2009

Relational DB

RDF / Relational Conversion

XML DB

RDF / XML Conversion

Common Model (RDF)

Mappings

SPARQLquery

We will focus on exposing relational data

sources

Page 38: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 38

RDB2RDF: Two Approaches

Extract-Transform-Load• Data replicated as RDF

– Data can become stale

• Native SPARQL query support– Limited optimisation

mechanisms

Existing RDF stores• Jena• Seasame

Query-driven Conversion• Data stored as relations

• Native SQL query support– Highly optimised access methods

• SPARQL queries must be translated

Existing translation systems• D2RQ• SquirrelRDF

23 June 2009

Page 39: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 39

System Test Hypothesis

Is it viable to perform query-driven conversions to facilitate data access from a data model that an astronomer is familiar with?

Can RDB2RDF tools feasibly expose large science archives for data integration?

23 June 2009

Relational DB

RDB2RDF

XML DB

RDF / XML Conversion

Common Model (RDF)

Mappings

SPARQLquery

SPARQLquery

Page 40: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 40

Astronomical Test Data Set

• SuperCOSMOS Science Archive (SSA)– Data extracted from scans of Schmidt plates– Stored in a relational database– About 4TB of data, detailing 6.4 billion objects– Fairly typical of astronomical data archives

• Schema designed using 20 real queries• Personal version contains

– Data for a specific region of the sky– About 0.1% of the data– About 500MB

23 June 2009

Ima

ge:

Su

perC

OS

MO

S S

cien

ce A

rchi

ve

Page 41: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 41

Analysis of Test Data

• Using personal version– About 500MB in size (similar size to related work)

• Organised in 14 Relations– Number of attributes: 2 – 152

• 4 relations with more than 20 attributes

– Number of rows: 3 – 585,560– Two views

• Complex selection criteria in views

23 June 2009

Makes this different from business cases and previous work!

Page 42: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 42

Is SPARQL expressive enough?

Can the 20 sample queries be expressed in SPARQL?

23 June 2009

Page 43: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 43

Real Science QueriesQuery 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5SQL:SELECT ra, dec, sCorMagB,

sCorMagR2, sCorMagIFROM ReliableStarsWHERE (sCorMagB-sCorMagR2

BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64)

SPARQL:SELECT ?ra ?decl ?sCorMagB

?sCorMagR2 ?sCorMagIWHERE {…<bindings>…FILTER (?sCorMagB –

?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80)

FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)}

23 June 2009

Page 44: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 44

Analysis of Test QueriesQuery Feature Query Numbers

Arithmetic in body 1-5, 7, 9, 12, 13, 15-20

Arithmetic in head 7-9, 12, 13

Ordering 1-8, 10-17, 19, 20

Joins (including self-joins) 12-17, 19

Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20

Aggregate functions (including Group By) 7-9, 18

Math functions (e.g. power, log, root) 4, 9, 16

Trigonometry functions 8, 12

Negated sub-query 18, 20

Type casting (e.g. Radians to degrees) 7, 8, 12

Server defined functions 10, 11

23 June 2009

Page 45: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 45

Expressivity of SPARQL

Features• Select-project-join• Arithmetic in body• Conjunction and disjunction• Ordering• String matching• External function calls

(extension mechanism)

Limitations• Range shorthands• Arithmetic in head• Math functions• Trigonometry functions• Sub queries• Aggregate functions• Casting

23 June 2009

Page 46: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 46

Analysis of Test QueriesQuery Feature Query Numbers

Arithmetic in body 1-5, 7, 9, 12, 13, 15-20

Arithmetic in head 7-9, 12, 13

Ordering 1-8, 10-17, 19, 20

Joins (including self-joins) 12-17, 19

Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20

Aggregate functions (including Group By) 7-9, 18

Math functions (e.g. power, log, root) 4, 9, 16

Trigonometry functions 8, 12

Negated sub-query 18, 20

Type casting (e.g. radians to degrees) 7, 8, 12

Server defined functions 10, 11

23 June 2009

Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19

Page 47: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 47

Can RDB2RDF tools feasibly expose large science archives for data integration?

23 June 2009

Page 48: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 48

Experiment• Time query evaluation

– 5 out of 20 queries used– No joins

• Systems compared:– Relational DB (Base line)

• MySQL v5.1.25

– RDB2RDF tools• D2RQ v0.5.2• SquirrelRDF v0.1

– RDF Triple stores• Jena v2.5.6 (SDB)• Sesame v2.1.3 (Native)

23 June 2009

Relational DB

RDB2RDF

SPARQLquery

Triple store

SPARQLquery

Relational DB

SQLquery

Page 49: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 49

Experimental Configuration

• 8 identical machines– 64 bit Intel Quad Core Xeon 2.4GHz– 4GB RAM– 100 GB Hard drive– Java 1.6– Linux

• 10 repetitions

23 June 2009

Page 50: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 50

Performance Results

23 June 2009

# Query 1 # Query 2 # Query 3 # Query 5 # Query 60

100

200

300

400

500

600

700

800

900

1000

MySQLD2RQSqRDFJenaSesame

ms

3,45

0

5,33

9

21,4

92

485

,93

2

2,73

3

7,22

9

4,09

01,

307

17,7

93

7,46

8

19,9

84

372

,56

1

1

Page 51: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 51

The Show Stopper: Query Translation

• Each bound variable resulted in a self-join– RDBMS cannot optimize for this– RDBMS perform badly with self-joins

• Each row retrieved with a separate query– 1 query becomes n queries,

where n is cardinality of relation• Predicate selection in RDB2RDF tool

– No optimization possible23 June 2009

Page 52: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 52

Extracting Relevant Data: Conclusions

• SPARQL not expressive enough for real (astronomy) queries

• RDBMS benefits from 30+ years research– Query optimisation– Indexes

• RDF stores are improving– Require existing data to be replicated

• RDB2RDF tools show promise– Need to exploit relational database

23 June 2009

Page 53: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 53

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?

Not currently!

More work needed on query translation…

23 June 2009

Page 54: Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

A.J.G. Gray - Bolzano-Bozen Seminar 54

Conclusions & Future WorkTraditional Integration Challenges

1. Locating data

2. Extracting relevant data

3. Understanding data

Semantic Web Solution• SKOS Vocabularies

– Removes ambiguity– Enables limited machine

understanding

• RDB2RDF Tools– Requires improved query

translation

• Semantic model mappings– Follow “chains” of mappings– Relies on RDB2RDF work

23 June 2009