sem tech 2011 v8

49
Nele, living with lupus Building a semantic integration framework to support a federated query environment in 5 steps Philip Ashworth UCB Celltech Dean Allemang TopQuadrant

Upload: dallemang

Post on 08-Jun-2015

1.008 views

Category:

Education


1 download

DESCRIPTION

Phil Ashworth and Dean Allemang, "Building a semantic integration framework to support a federated query "

TRANSCRIPT

Page 1: Sem tech 2011 v8

Nele, living with lupus

Building a semantic integration framework to support a federated query environment in 5 steps

Philip Ashworth UCB Celltech

Dean Allemang TopQuadrant

Page 2: Sem tech 2011 v8

Data Integration… Why?

Scope and knowledge of life sciences expands everyday

Everyday we make new discoveries by experimenting (in the lab)

Data generated in the lab in large quantities to complement the vast growth externally

Too difficult and time consuming for the user to bring data together

Therefore we don’t often make use of the data we already have to make new discoveries

Page 3: Sem tech 2011 v8

Data Integration… Problems

Warehouse DB

Project DB

Project Marts

Application

s

App DB

App DB’s

App DB’s

Registration, Query

DI, Query

DI Query

DI

App DB’s

Page 4: Sem tech 2011 v8

Data Integration… Problems

Demand for DI increases everyday.

Data doesn’t evolve into a larger more beneficial platform• Where is the long term benefit?• Driving ourselves around in circles

Just creating more data silos• Limited scope for reuse

Slow & difficult to modify / enhance

High maintenance• Multiple systems create more and more overhead

Page 5: Sem tech 2011 v8

Data Integration… Thoughts

Data Integration is clearly evolving

But it is not fulfilling the needs

If we identify the need… can we see what we should be doing?

Page 6: Sem tech 2011 v8

Accessible Data

True Integration

Variety of Sources

Align Concepts

Data hasContext

All Data for All Projects

Data Integration… Needs

Page 7: Sem tech 2011 v8

Open Linked Data Cloud

Connected and linked data with context

Created by a community

Significant linking hubs appearingSignificant scientific content

A Valuable resource that will only Grow!Something we can learn from!

Data Integration… There is a way!

Page 8: Sem tech 2011 v8

Data Integration… Starting an Evolutionary Leap

No one internally really knows about this

Can’t just rip and replace old systems

Have to do some ground work

Page 9: Sem tech 2011 v8

Linked Data…The Quest

Technology Projects • Emphasis on semantic web principles

Scientific Projects• Data Integration

• Data Visualisation (mash-ups)

Page 10: Sem tech 2011 v8

Linked Data… The Quest

Highly Repetitive &

Promiscuous

Highly Promiscuous & Repetitive

Page 11: Sem tech 2011 v8

New Approach

Develop a POC semantic data integration framework• Easy to configure

• Support all projects

• Builds an environment for the future.

Linked Data

Page 12: Sem tech 2011 v8

Rest Services (Abstraction layer)

Data Sources

RDBMSOracle,Postgres

SQL, mySql

RDF Triple Store

MS ExcelTXTDoc

RDFSparql EndPoint Sparql EndPointNative

Semantic Integration FrameworkKnowledge Collation, Concept mapping, Distributed Query

Result inference, Aggregation

Increasing Ease of DevelopmentDecreasing knowledge of Semantic technologies

The Idea

Applications

Business Process / Workflow AutomationP

UR

L

Page 13: Sem tech 2011 v8

Step 1. Data Sources

Expose data as RDF through SPARQL EndpointsInternal Data sources• D2R SPARQL Endpoints on RDBMS databases

• Each Modelled as local concepts that they represent• Don’t worry about the larger concept picture

• Virtuoso RDF triple store (Open source) to host RDF data created from spreadsheets

• TopBraid Ensemble & SPARQLMotion/SPIN scripts to convert static data to RDF

RDBMS

D2R

SPARQL Endpoints

Virtuoso

RDF

Page 14: Sem tech 2011 v8

External Data Sources• SPARQL endpoints in LOD from Bio2RDF, LODD and others.• Some stability, access, quality issues within these sources.• Created Amazon Cloud server to host stable environments.• Bio2RDF sources downloaded, stored and modified• Virtuoso (open source) used as triple store

Step 1. Data Sources

IDACIDAC

MOCMOC

PEPPEP

UCB Data CloudUCB Data Cloud

Linked Open Data CloudLinked Open Data Cloud

AbysisAbysis

NBEMartNBEMart

SEQSEQ

Bio2RDFPDB

Bio2RDFPDB

NBEWH

NBEWH

ITrackITrack

PMTPMT

LDAPLDAP

WKWWKW

UCBPDBUCBPDB

PremierPremier

SiderSider

Keggcpd

Keggcpd

Diseasome

Diseasome

Kegggl

Kegggl

Keggdr

Keggdr

chebichebi

Uniprotec

Uniprotec

geneid geneid

RDF

Page 15: Sem tech 2011 v8

Step 2: Integration Framework:

Why?• Linked Open Data: links within a source are manually created• To Navigate the cloud you either

• Learn the network

• Discover the network as you go through (unguided)

• There is nothing that understands the total connectivity of concepts available to you.

• Difficult to know where start

• No idea if a start point will lead you to the information you are looking for or might be interested in.

• Can’t query the cloud for specific Information

The Integration Framework will resolve these issues• It will model the models to understand the connectivity

You shouldn’t have to know where to look for data

Page 16: Sem tech 2011 v8

Rest Services (Abstraction layer)

Semantic Integration FrameworkKnowledge Collation, Concept mapping, Distributed Query

Result inference, Aggregation

Applications

Business Process / Workflow AutomationP

UR

L

RDFRDF

Data SourcesData Sources

Understand Data Sources

(concepts, access, props)

Understand Links Across

Sources

Automate some tasks

Accessible Via Services

Easy to wire up

Understand UCB concepts

Understand how UCB

Concepts fit with source

concepts

Step 2: Integration Framework

Page 17: Sem tech 2011 v8

Step 2: Integration Framework.

Integration Framework• Data source, concept and property registry• An Ontology that Utilises

• VoID (enhanced) to capture data source information (endpoints)

• SKOS to link local ontologies with UCB concepts

• UCB:Person -> db1:user, db2:employee, db3:actor

Built using TopBraid Suite• Ontology development (TopBraid Composer)• SPARQLMotion scripts to provide some automation

• Creation of ontologies from endpoints, D2R mappings

• Configuration assistance

Sem Int Framework

Page 18: Sem tech 2011 v8

Step 2: Integration Framework.

DB1

Dataset Ontology (VoID)Dataset Ontology (VoID)

UCB Concept Ontology (SKOS)UCB Concept Ontology (SKOS)

narrowMatchUCB:Antibody DB1:Antibod

y

UCB:PersonDB1:User

narrowMatch

narrowMatchUCB:Project

DB1:Project

Integration Framework

Sem Int Framework

Page 19: Sem tech 2011 v8

Step 2: Integration Framework.

DB1

Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)

UCB Concept Ontology (SKOS)

narrowMatch

DB2:Person

UCB:Person DB1:UsernarrowMatch

DB3:Employee

DB2 DB3

narro

wM

atch

DB3:Contact

narrowMatch

Sem Int Framework

Page 20: Sem tech 2011 v8

DB1

Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)

UCB Concept Ontology (SKOS)

narrowMatch

DB2:Person

UCB:Person DB1:UsernarrowMatch

DB3:Employee

DB2 DB3

narro

wM

atch

DB3:Contact

narrowMatch

Step 2: Integration Framework.

Person_DB1_DB2

Person_DB1_DB3

Linksets

Sem Int Framework

Page 21: Sem tech 2011 v8

Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)

UCB Concept Ontology (SKOS)

2 31

107

4

8

5

9

6

1211

Step 2: Integration Framework. Sem Int Framework

Page 22: Sem tech 2011 v8

Step 3: Rest Services

Rest Services• Interaction point for applications

• Expose simple and generic access to the Integration framework

• Removes complexity of framework and how to ask questions of it.• You don’t need to know how to make it work

• You don’t need to know anything about the datasets and the concepts and properties held within.

• Just ask simple questions in the UCB language• Tell me about UCB:Person “ashworth”

• Built using SPARQLMotion/SPIN and exposed in TopBraid Live enterprise server.

• Two simple yet very effective services created

Rest Services

Page 23: Sem tech 2011 v8

Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)

UCB Concept Ontology (SKOS)

DB1 DB2 DB3

Keyword Search Get Info

Find UCB:Person “phil”

Search DB1:User

Tell me the sub-types of UCB:Person

Here are the resources for “phil”

ldap:U0xx10x, itrack:101, moc:scordisp etc….

Search DB3:Employee

Search DB3:Contact

Search DB2:Person

Step 3: Rest Services Rest Services

Can the linksets tell us any info?

Tell me the datasets for the sub-types

Page 24: Sem tech 2011 v8

Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)

UCB Concept Ontology (SKOS)

DB1

Keyword Search Get Info

Tell me the super-types of all resources

Retrieve DB1:U0xx10x

Tell me about moc:scordispHere is everything I know about it.

DB2 DB3

Retrieve DB2:scordispRetrieve DB3:philscordis

Step 3: Rest Services

Tell me everything about this resource?

Rest Services

Page 25: Sem tech 2011 v8

Data Exploration environment

• Search concepts

• Display data

• Allow link following.

• Deals with any concept defined in UCB SKOS language

• Uses two framework services mentioned previously.

• Deployed in TopBraid Ensemble – Live

Step 4: Building an Application 1 Applications

Page 26: Sem tech 2011 v8

Step 4: Data Exploration

UCB Concepts

Search submitted to “Keyword

Search” Service

Applications

Page 27: Sem tech 2011 v8

Step 4: Data Exploration

Results Displayed.

Index shows inference is

already taking place

Applications

Page 28: Sem tech 2011 v8

Step 4: Data Exploration

Drag Instance to basket, Initiates

“Get Info” Service call

Applications

Page 29: Sem tech 2011 v8

Step 4: Data Exploration

Select InstanceData Displayed

per Source

Applications

Page 30: Sem tech 2011 v8

Step 4: Data Exploration

Links to other data items

Applications

Page 31: Sem tech 2011 v8

Step 4: Data Exploration

Displays Sparse data

Submit Instance to“Get info” service

Applications

Page 32: Sem tech 2011 v8

Step 4: Data Exploration

More Detailed Information

Applications

Page 33: Sem tech 2011 v8

Step 4: Data Exploration

He has another interaction.

Lets Explore.

Applications

Page 34: Sem tech 2011 v8

Step 4: Data Exploration Applications

Page 35: Sem tech 2011 v8

Step 4: Data Exploration Applications

Data cached as we navigated Concept

Explorer. Can now be investigated.

Page 36: Sem tech 2011 v8

Step 4: Data Exploration

Structure concept Keyword Search pulls data from internal and external data sources

Add to basket

After detailed Information retrieved a second Structure has been identified without a

keyword search

Integrated Internal and External data

Applications

Page 37: Sem tech 2011 v8

Step 4: Data Exploration Applications

Page 38: Sem tech 2011 v8

Federated data gathering & marting• Data marting without the warehouse

• New Mart Rest Service

• SPARQLMotion/SPIN scripts

• Dump_UCB:Antibody

• Still uses framework to integrate data

• On the fly data integration

• Gather RDF from data sources

• Dump into tables

• Data consumed by traditional query tools

• Not particularly designed for this aspect… (slow)

• But works!

Step 4: Building an Application 2 Applications

Page 39: Sem tech 2011 v8

Knowledge Base Creation• Gathering information can be a time consuming exercise

• But is vital for projects to have

• Different individuals have different ideas

• Relevance, sources etc, presentation

• Knowledge Base to provide consistency for

• Data gathered

• Data sources used

• Data presentation

• ROI

• 150 fold Increase in efficiency

• 6mins compared to > 16hrs (over several weeks)

• Information available to all at central access point

Step 4: Building an Application 3 Applications

Page 40: Sem tech 2011 v8

Step 4: Knowledge Base

Semantic Integration Framework

Keyword Search Get Info

Data Sources

App Service

“Tell me about the protein with Gene ID X” and I want to know about Literature Refs, Sequences, Descriptions, Structure…… etc.

Applications

Page 41: Sem tech 2011 v8

Step 4: Knowledge Base Applications

Page 42: Sem tech 2011 v8

Step 4: Knowledge Base Applications

Page 43: Sem tech 2011 v8

Step 4: Knowledge Base Applications

Page 44: Sem tech 2011 v8

Step 4: Knowledge Base Applications

Page 45: Sem tech 2011 v8

Step 4: Knowledge Base Applications

Page 46: Sem tech 2011 v8

Step 5: Purl Server

Removing URL dependencies

D2R publishes resolvable URLs’ as specific to the server

Removing URL specificity with PURL server

Allows each layer of the architecture to be removed without all the others having to be reconfigured

• Level of independence / indirection

Only done on limited scale

PU

RL

Page 47: Sem tech 2011 v8

Conclusions & Business value

We have built an extensible data integration framework

• Shown how data integration can be an incremental process• Started with three datasets, more than 20 a few months later• Compare warehouse took 18 months to add two new data sources• Adding a new source can take less than a day (whole process, inc

endpoint creation)• Creates an enterprise-wide “data fabric” rather than just one more

application

• Connect datasets together like web pages fit together• Literally click from one dataset to the other• Dynamically mash-up data from multiple sources• Add new sources by describing the connections, not by building a new

application

Page 48: Sem tech 2011 v8

Conclusions & Business value

We have built a framework that

• Differs from data integration applications the way the Web differs from earlier network technologies (ftp, archie)• Infrastructure allows new entities (pages, databases) to be added

dynamically• Adding connections is as easy as specifying them

• Provides data for all projects• Three very different applications have been demonstrated • All are able to use the same framework• Reuse

Page 49: Sem tech 2011 v8

49

Questions?