sem tech 2011 v8

Nele, living with lupus

Building a semantic integration framework to support a federated query environment in 5 steps

Philip Ashworth UCB Celltech

Dean Allemang TopQuadrant

Data Integration… Why?

Scope and knowledge of life sciences expands everyday

Everyday we make new discoveries by experimenting (in the lab)

Data generated in the lab in large quantities to complement the vast growth externally

Too difficult and time consuming for the user to bring data together

Therefore we don’t often make use of the data we already have to make new discoveries

Data Integration… Problems

Warehouse DB

Project DB

Project Marts

Application

s

App DB

App DB’s

App DB’s

Registration, Query

DI, Query

DI Query

DI

App DB’s

Data Integration… Problems

Demand for DI increases everyday.

Data doesn’t evolve into a larger more beneficial platform• Where is the long term benefit?• Driving ourselves around in circles

Just creating more data silos• Limited scope for reuse

Slow & difficult to modify / enhance

High maintenance• Multiple systems create more and more overhead

Data Integration… Thoughts

Data Integration is clearly evolving

But it is not fulfilling the needs

If we identify the need… can we see what we should be doing?

Accessible Data

True Integration

Variety of Sources

Align Concepts

Data hasContext

All Data for All Projects

Data Integration… Needs

Open Linked Data Cloud

Connected and linked data with context

Created by a community

Significant linking hubs appearingSignificant scientific content

A Valuable resource that will only Grow!Something we can learn from!

Data Integration… There is a way!

Data Integration… Starting an Evolutionary Leap

No one internally really knows about this

Can’t just rip and replace old systems

Have to do some ground work

Linked Data…The Quest

Technology Projects • Emphasis on semantic web principles

Scientific Projects• Data Integration

• Data Visualisation (mash-ups)

Linked Data… The Quest

Highly Repetitive &

Promiscuous

Highly Promiscuous & Repetitive

New Approach

Develop a POC semantic data integration framework• Easy to configure

• Support all projects

• Builds an environment for the future.

Linked Data

Rest Services (Abstraction layer)

Data Sources

RDBMSOracle,Postgres

SQL, mySql

RDF Triple Store

MS ExcelTXTDoc

RDFSparql EndPoint Sparql EndPointNative

Semantic Integration FrameworkKnowledge Collation, Concept mapping, Distributed Query

Result inference, Aggregation

Increasing Ease of DevelopmentDecreasing knowledge of Semantic technologies

The Idea

Applications

Business Process / Workflow AutomationP

UR

L

Step 1. Data Sources

Expose data as RDF through SPARQL EndpointsInternal Data sources• D2R SPARQL Endpoints on RDBMS databases

• Each Modelled as local concepts that they represent• Don’t worry about the larger concept picture

• Virtuoso RDF triple store (Open source) to host RDF data created from spreadsheets

• TopBraid Ensemble & SPARQLMotion/SPIN scripts to convert static data to RDF

RDBMS

D2R

SPARQL Endpoints

Virtuoso

RDF

External Data Sources• SPARQL endpoints in LOD from Bio2RDF, LODD and others.• Some stability, access, quality issues within these sources.• Created Amazon Cloud server to host stable environments.• Bio2RDF sources downloaded, stored and modified• Virtuoso (open source) used as triple store

Step 1. Data Sources

IDACIDAC

MOCMOC

PEPPEP

UCB Data CloudUCB Data Cloud

Linked Open Data CloudLinked Open Data Cloud

AbysisAbysis

NBEMartNBEMart

SEQSEQ

Bio2RDFPDB

Bio2RDFPDB

NBEWH

NBEWH

ITrackITrack

PMTPMT

LDAPLDAP

WKWWKW

UCBPDBUCBPDB

PremierPremier

SiderSider

Keggcpd

Keggcpd

Diseasome

Diseasome

Kegggl

Kegggl

Keggdr

Keggdr

chebichebi

Uniprotec

Uniprotec

geneid geneid

RDF

Step 2: Integration Framework:

Why?• Linked Open Data: links within a source are manually created• To Navigate the cloud you either

• Learn the network

• Discover the network as you go through (unguided)

• There is nothing that understands the total connectivity of concepts available to you.

• Difficult to know where start

• No idea if a start point will lead you to the information you are looking for or might be interested in.

• Can’t query the cloud for specific Information

The Integration Framework will resolve these issues• It will model the models to understand the connectivity

You shouldn’t have to know where to look for data

Rest Services (Abstraction layer)

Semantic Integration FrameworkKnowledge Collation, Concept mapping, Distributed Query

Result inference, Aggregation

Applications

Business Process / Workflow AutomationP

UR

L

RDFRDF

Data SourcesData Sources

Understand Data Sources

(concepts, access, props)

Understand Links Across

Sources

Automate some tasks

Accessible Via Services

Easy to wire up

Understand UCB concepts

Understand how UCB

Concepts fit with source

concepts

Step 2: Integration Framework

Step 2: Integration Framework.

Integration Framework• Data source, concept and property registry• An Ontology that Utilises

• VoID (enhanced) to capture data source information (endpoints)

• SKOS to link local ontologies with UCB concepts

• UCB:Person -> db1:user, db2:employee, db3:actor

Built using TopBraid Suite• Ontology development (TopBraid Composer)• SPARQLMotion scripts to provide some automation

• Creation of ontologies from endpoints, D2R mappings

• Configuration assistance

Sem Int Framework


DB1

Dataset Ontology (VoID)Dataset Ontology (VoID)

UCB Concept Ontology (SKOS)UCB Concept Ontology (SKOS)

narrowMatchUCB:Antibody DB1:Antibod

y

UCB:PersonDB1:User

narrowMatch

narrowMatchUCB:Project

DB1:Project

Integration Framework

Sem Int Framework


DB1

Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)

UCB Concept Ontology (SKOS)

narrowMatch

DB2:Person

UCB:Person DB1:UsernarrowMatch

DB3:Employee

DB2 DB3

narro

wM

atch

DB3:Contact

narrowMatch

Sem Int Framework

DB1



narrowMatch

DB2:Person

UCB:Person DB1:UsernarrowMatch

DB3:Employee

DB2 DB3

narro

wM

atch

DB3:Contact

narrowMatch


Person_DB1_DB2

Person_DB1_DB3

Linksets

Sem Int Framework



2 31

107

4

8

5

9

6

1211

Step 2: Integration Framework. Sem Int Framework

Step 3: Rest Services

Rest Services• Interaction point for applications

• Expose simple and generic access to the Integration framework

• Removes complexity of framework and how to ask questions of it.• You don’t need to know how to make it work

• You don’t need to know anything about the datasets and the concepts and properties held within.

• Just ask simple questions in the UCB language• Tell me about UCB:Person “ashworth”

• Built using SPARQLMotion/SPIN and exposed in TopBraid Live enterprise server.

• Two simple yet very effective services created

Rest Services



DB1 DB2 DB3

Keyword Search Get Info

Find UCB:Person “phil”

Search DB1:User

Tell me the sub-types of UCB:Person

Here are the resources for “phil”

ldap:U0xx10x, itrack:101, moc:scordisp etc….

Search DB3:Employee

Search DB3:Contact

Search DB2:Person

Step 3: Rest Services Rest Services

Can the linksets tell us any info?

Tell me the datasets for the sub-types



DB1


Tell me the super-types of all resources

Retrieve DB1:U0xx10x

Tell me about moc:scordispHere is everything I know about it.

DB2 DB3

Retrieve DB2:scordispRetrieve DB3:philscordis

Step 3: Rest Services

Tell me everything about this resource?

Rest Services

Data Exploration environment

• Search concepts

• Display data

• Allow link following.

• Deals with any concept defined in UCB SKOS language

• Uses two framework services mentioned previously.

• Deployed in TopBraid Ensemble – Live

Step 4: Building an Application 1 Applications

Step 4: Data Exploration

UCB Concepts

Search submitted to “Keyword

Search” Service

Applications


Results Displayed.

Index shows inference is

already taking place

Applications


Drag Instance to basket, Initiates

“Get Info” Service call

Applications


Select InstanceData Displayed

per Source

Applications


Links to other data items

Applications


Displays Sparse data

Submit Instance to“Get info” service

Applications


More Detailed Information

Applications


He has another interaction.

Lets Explore.

Applications

Step 4: Data Exploration Applications


Data cached as we navigated Concept

Explorer. Can now be investigated.


Structure concept Keyword Search pulls data from internal and external data sources

Add to basket

After detailed Information retrieved a second Structure has been identified without a

keyword search

Integrated Internal and External data

Applications

Federated data gathering & marting• Data marting without the warehouse

• New Mart Rest Service

• SPARQLMotion/SPIN scripts

• Dump_UCB:Antibody

• Still uses framework to integrate data

• On the fly data integration

• Gather RDF from data sources

• Dump into tables

• Data consumed by traditional query tools

• Not particularly designed for this aspect… (slow)

• But works!


Knowledge Base Creation• Gathering information can be a time consuming exercise

• But is vital for projects to have

• Different individuals have different ideas

• Relevance, sources etc, presentation

• Knowledge Base to provide consistency for

• Data gathered

• Data sources used

• Data presentation

• ROI

• 150 fold Increase in efficiency

• 6mins compared to > 16hrs (over several weeks)

• Information available to all at central access point


Step 4: Knowledge Base

Semantic Integration Framework


Data Sources

App Service

“Tell me about the protein with Gene ID X” and I want to know about Literature Refs, Sequences, Descriptions, Structure…… etc.

Applications

Step 4: Knowledge Base Applications

Step 5: Purl Server

Removing URL dependencies

D2R publishes resolvable URLs’ as specific to the server

Removing URL specificity with PURL server

Allows each layer of the architecture to be removed without all the others having to be reconfigured

• Level of independence / indirection

Only done on limited scale

PU

RL

Conclusions & Business value

We have built an extensible data integration framework

• Shown how data integration can be an incremental process• Started with three datasets, more than 20 a few months later• Compare warehouse took 18 months to add two new data sources• Adding a new source can take less than a day (whole process, inc

endpoint creation)• Creates an enterprise-wide “data fabric” rather than just one more

application

• Connect datasets together like web pages fit together• Literally click from one dataset to the other• Dynamically mash-up data from multiple sources• Add new sources by describing the connections, not by building a new

application

Conclusions & Business value

We have built a framework that

• Differs from data integration applications the way the Web differs from earlier network technologies (ftp, archie)• Infrastructure allows new entities (pages, databases) to be added

dynamically• Adding connections is as easy as specifying them

• Provides data for all projects• Three very different applications have been demonstrated • All are able to use the same framework• Reuse

49

Questions?

sem tech 2011 v8

Education

concepts data

lab data

static data

data doesnt

data sources concepts

open linked data cloud

data silos limited scope

ucb concepts ucb