sem tech 2011 v8
DESCRIPTION
Phil Ashworth and Dean Allemang, "Building a semantic integration framework to support a federated query "TRANSCRIPT
Nele, living with lupus
Building a semantic integration framework to support a federated query environment in 5 steps
Philip Ashworth UCB Celltech
Dean Allemang TopQuadrant
Data Integration… Why?
Scope and knowledge of life sciences expands everyday
Everyday we make new discoveries by experimenting (in the lab)
Data generated in the lab in large quantities to complement the vast growth externally
Too difficult and time consuming for the user to bring data together
Therefore we don’t often make use of the data we already have to make new discoveries
Data Integration… Problems
Warehouse DB
Project DB
Project Marts
Application
s
App DB
App DB’s
App DB’s
Registration, Query
DI, Query
DI Query
DI
App DB’s
Data Integration… Problems
Demand for DI increases everyday.
Data doesn’t evolve into a larger more beneficial platform• Where is the long term benefit?• Driving ourselves around in circles
Just creating more data silos• Limited scope for reuse
Slow & difficult to modify / enhance
High maintenance• Multiple systems create more and more overhead
Data Integration… Thoughts
Data Integration is clearly evolving
But it is not fulfilling the needs
If we identify the need… can we see what we should be doing?
Accessible Data
True Integration
Variety of Sources
Align Concepts
Data hasContext
All Data for All Projects
Data Integration… Needs
Open Linked Data Cloud
Connected and linked data with context
Created by a community
Significant linking hubs appearingSignificant scientific content
A Valuable resource that will only Grow!Something we can learn from!
Data Integration… There is a way!
Data Integration… Starting an Evolutionary Leap
No one internally really knows about this
Can’t just rip and replace old systems
Have to do some ground work
Linked Data…The Quest
Technology Projects • Emphasis on semantic web principles
Scientific Projects• Data Integration
• Data Visualisation (mash-ups)
Linked Data… The Quest
Highly Repetitive &
Promiscuous
Highly Promiscuous & Repetitive
New Approach
Develop a POC semantic data integration framework• Easy to configure
• Support all projects
• Builds an environment for the future.
Linked Data
Rest Services (Abstraction layer)
Data Sources
RDBMSOracle,Postgres
SQL, mySql
RDF Triple Store
MS ExcelTXTDoc
RDFSparql EndPoint Sparql EndPointNative
Semantic Integration FrameworkKnowledge Collation, Concept mapping, Distributed Query
Result inference, Aggregation
Increasing Ease of DevelopmentDecreasing knowledge of Semantic technologies
The Idea
Applications
Business Process / Workflow AutomationP
UR
L
Step 1. Data Sources
Expose data as RDF through SPARQL EndpointsInternal Data sources• D2R SPARQL Endpoints on RDBMS databases
• Each Modelled as local concepts that they represent• Don’t worry about the larger concept picture
• Virtuoso RDF triple store (Open source) to host RDF data created from spreadsheets
• TopBraid Ensemble & SPARQLMotion/SPIN scripts to convert static data to RDF
RDBMS
D2R
SPARQL Endpoints
Virtuoso
RDF
External Data Sources• SPARQL endpoints in LOD from Bio2RDF, LODD and others.• Some stability, access, quality issues within these sources.• Created Amazon Cloud server to host stable environments.• Bio2RDF sources downloaded, stored and modified• Virtuoso (open source) used as triple store
Step 1. Data Sources
IDACIDAC
MOCMOC
PEPPEP
UCB Data CloudUCB Data Cloud
Linked Open Data CloudLinked Open Data Cloud
AbysisAbysis
NBEMartNBEMart
SEQSEQ
Bio2RDFPDB
Bio2RDFPDB
NBEWH
NBEWH
ITrackITrack
PMTPMT
LDAPLDAP
WKWWKW
UCBPDBUCBPDB
PremierPremier
SiderSider
Keggcpd
Keggcpd
Diseasome
Diseasome
Kegggl
Kegggl
Keggdr
Keggdr
chebichebi
Uniprotec
Uniprotec
geneid geneid
RDF
Step 2: Integration Framework:
Why?• Linked Open Data: links within a source are manually created• To Navigate the cloud you either
• Learn the network
• Discover the network as you go through (unguided)
• There is nothing that understands the total connectivity of concepts available to you.
• Difficult to know where start
• No idea if a start point will lead you to the information you are looking for or might be interested in.
• Can’t query the cloud for specific Information
The Integration Framework will resolve these issues• It will model the models to understand the connectivity
You shouldn’t have to know where to look for data
Rest Services (Abstraction layer)
Semantic Integration FrameworkKnowledge Collation, Concept mapping, Distributed Query
Result inference, Aggregation
Applications
Business Process / Workflow AutomationP
UR
L
RDFRDF
Data SourcesData Sources
Understand Data Sources
(concepts, access, props)
Understand Links Across
Sources
Automate some tasks
Accessible Via Services
Easy to wire up
Understand UCB concepts
Understand how UCB
Concepts fit with source
concepts
Step 2: Integration Framework
Step 2: Integration Framework.
Integration Framework• Data source, concept and property registry• An Ontology that Utilises
• VoID (enhanced) to capture data source information (endpoints)
• SKOS to link local ontologies with UCB concepts
• UCB:Person -> db1:user, db2:employee, db3:actor
Built using TopBraid Suite• Ontology development (TopBraid Composer)• SPARQLMotion scripts to provide some automation
• Creation of ontologies from endpoints, D2R mappings
• Configuration assistance
Sem Int Framework
Step 2: Integration Framework.
DB1
Dataset Ontology (VoID)Dataset Ontology (VoID)
UCB Concept Ontology (SKOS)UCB Concept Ontology (SKOS)
narrowMatchUCB:Antibody DB1:Antibod
y
UCB:PersonDB1:User
narrowMatch
narrowMatchUCB:Project
DB1:Project
Integration Framework
Sem Int Framework
Step 2: Integration Framework.
DB1
Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)
UCB Concept Ontology (SKOS)
narrowMatch
DB2:Person
UCB:Person DB1:UsernarrowMatch
DB3:Employee
DB2 DB3
narro
wM
atch
DB3:Contact
narrowMatch
Sem Int Framework
DB1
Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)
UCB Concept Ontology (SKOS)
narrowMatch
DB2:Person
UCB:Person DB1:UsernarrowMatch
DB3:Employee
DB2 DB3
narro
wM
atch
DB3:Contact
narrowMatch
Step 2: Integration Framework.
Person_DB1_DB2
Person_DB1_DB3
Linksets
Sem Int Framework
Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)
UCB Concept Ontology (SKOS)
2 31
107
4
8
5
9
6
1211
Step 2: Integration Framework. Sem Int Framework
Step 3: Rest Services
Rest Services• Interaction point for applications
• Expose simple and generic access to the Integration framework
• Removes complexity of framework and how to ask questions of it.• You don’t need to know how to make it work
• You don’t need to know anything about the datasets and the concepts and properties held within.
• Just ask simple questions in the UCB language• Tell me about UCB:Person “ashworth”
• Built using SPARQLMotion/SPIN and exposed in TopBraid Live enterprise server.
• Two simple yet very effective services created
Rest Services
Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)
UCB Concept Ontology (SKOS)
DB1 DB2 DB3
Keyword Search Get Info
Find UCB:Person “phil”
Search DB1:User
Tell me the sub-types of UCB:Person
Here are the resources for “phil”
ldap:U0xx10x, itrack:101, moc:scordisp etc….
Search DB3:Employee
Search DB3:Contact
Search DB2:Person
Step 3: Rest Services Rest Services
Can the linksets tell us any info?
Tell me the datasets for the sub-types
Dataset Ontology (VoID)Dataset Ontology (VoID) UCB Concept Ontology (SKOS)
UCB Concept Ontology (SKOS)
DB1
Keyword Search Get Info
Tell me the super-types of all resources
Retrieve DB1:U0xx10x
Tell me about moc:scordispHere is everything I know about it.
DB2 DB3
Retrieve DB2:scordispRetrieve DB3:philscordis
Step 3: Rest Services
Tell me everything about this resource?
Rest Services
Data Exploration environment
• Search concepts
• Display data
• Allow link following.
• Deals with any concept defined in UCB SKOS language
• Uses two framework services mentioned previously.
• Deployed in TopBraid Ensemble – Live
Step 4: Building an Application 1 Applications
Step 4: Data Exploration
UCB Concepts
Search submitted to “Keyword
Search” Service
Applications
Step 4: Data Exploration
Results Displayed.
Index shows inference is
already taking place
Applications
Step 4: Data Exploration
Drag Instance to basket, Initiates
“Get Info” Service call
Applications
Step 4: Data Exploration
Select InstanceData Displayed
per Source
Applications
Step 4: Data Exploration
Links to other data items
Applications
Step 4: Data Exploration
Displays Sparse data
Submit Instance to“Get info” service
Applications
Step 4: Data Exploration
More Detailed Information
Applications
Step 4: Data Exploration
He has another interaction.
Lets Explore.
Applications
Step 4: Data Exploration Applications
Step 4: Data Exploration Applications
Data cached as we navigated Concept
Explorer. Can now be investigated.
Step 4: Data Exploration
Structure concept Keyword Search pulls data from internal and external data sources
Add to basket
After detailed Information retrieved a second Structure has been identified without a
keyword search
Integrated Internal and External data
Applications
Step 4: Data Exploration Applications
Federated data gathering & marting• Data marting without the warehouse
• New Mart Rest Service
• SPARQLMotion/SPIN scripts
• Dump_UCB:Antibody
• Still uses framework to integrate data
• On the fly data integration
• Gather RDF from data sources
• Dump into tables
• Data consumed by traditional query tools
• Not particularly designed for this aspect… (slow)
• But works!
Step 4: Building an Application 2 Applications
Knowledge Base Creation• Gathering information can be a time consuming exercise
• But is vital for projects to have
• Different individuals have different ideas
• Relevance, sources etc, presentation
• Knowledge Base to provide consistency for
• Data gathered
• Data sources used
• Data presentation
• ROI
• 150 fold Increase in efficiency
• 6mins compared to > 16hrs (over several weeks)
• Information available to all at central access point
Step 4: Building an Application 3 Applications
Step 4: Knowledge Base
Semantic Integration Framework
Keyword Search Get Info
Data Sources
App Service
“Tell me about the protein with Gene ID X” and I want to know about Literature Refs, Sequences, Descriptions, Structure…… etc.
Applications
Step 4: Knowledge Base Applications
Step 4: Knowledge Base Applications
Step 4: Knowledge Base Applications
Step 4: Knowledge Base Applications
Step 4: Knowledge Base Applications
Step 5: Purl Server
Removing URL dependencies
D2R publishes resolvable URLs’ as specific to the server
Removing URL specificity with PURL server
Allows each layer of the architecture to be removed without all the others having to be reconfigured
• Level of independence / indirection
Only done on limited scale
PU
RL
Conclusions & Business value
We have built an extensible data integration framework
• Shown how data integration can be an incremental process• Started with three datasets, more than 20 a few months later• Compare warehouse took 18 months to add two new data sources• Adding a new source can take less than a day (whole process, inc
endpoint creation)• Creates an enterprise-wide “data fabric” rather than just one more
application
• Connect datasets together like web pages fit together• Literally click from one dataset to the other• Dynamically mash-up data from multiple sources• Add new sources by describing the connections, not by building a new
application
Conclusions & Business value
We have built a framework that
• Differs from data integration applications the way the Web differs from earlier network technologies (ftp, archie)• Infrastructure allows new entities (pages, databases) to be added
dynamically• Adding connections is as easy as specifying them
• Provides data for all projects• Three very different applications have been demonstrated • All are able to use the same framework• Reuse
49
Questions?