the scientific method on the semantic web
DESCRIPTION
Presentation to the iCAPTURE Center, Heart + Lung Institute at St. Paul's HospitalTRANSCRIPT
SADI, SHARE and the Scientific Method
The Quest for the Holy Grail
The Problem
The Problem
The Holy Grail:(this slide created circa 2002)
Align the promoters of all serine threonine kinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.
Retrieve and align 2000nt 5' from every serine/threonine kinase in Mus musculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.
Two novel technologies
developed in our lab
are getting us very close to the Holy Grail!
Holy Grail Demo #1
Imagine there is a “virtual database” containing all of the data from all of the databases,together with the output of
every conceivable analysis
How do we query that database?
A Brief Digression…
“Database”
?
Boxes became ovals…
Straight lines became curvy lines…
Boxes became ovals…
Straight lines became curvy lines…
…and you want us to give you a grant for THAT??
Relational Database
“Graph”
Protein Table-----------------------
Protein IndexProtein NameRegulates ID
Gene Table-----------------------
Gene IDTissue IDType ID
http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487
isRepressor
Of
Protein Table-----------------------
Protein IndexProtein NameRegulates ID
Gene Table-----------------------
Gene IDTissue IDType ID
“Foreign keys” are used to link tables in a database
http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487
isRepressor
Of
Links in Graphs consist of statements called
“TRIPLES”
Protein Table-----------------------
Protein IndexProtein NameRegulates ID
Gene Table-----------------------
Gene IDTissue IDType ID
http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487
isRepressor
Of
Protein Table-----------------------
Protein IndexProtein NameRegulates ID
Gene Table-----------------------
Gene IDTissue IDType ID
Both Data Sources are on the Same Machine
http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487
isRepressor
Of
Graph Data Sources (may be) on Independent Machines on the Web
Protein Table-----------------------
Protein IndexProtein NameRegulates ID
Gene Table-----------------------
Gene IDTissue IDType ID
http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487
isRepressor
Of
Protein Table-----------------------
Protein IndexProtein NameRegulates ID
Gene Table-----------------------
Gene IDTissue IDType ID
“Meaning” of the connection between data-points is understood
only by the database administrator
http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487
isRepressor
Of
Protein regulates
Gene
“Meaning” of the connection in a Graph is explicitly labeled(and machine-readable!)
Protein Table-----------------------
Protein IndexProtein NameRegulates ID
Gene Table-----------------------
Gene IDTissue IDType ID
http://pdb.org/114487 http://ncbi.nlm/NR/NR_14487
isRepressor
Of
Connect all of the graphs in the world to one another
And what do you get?
Mark Butler (2003) Is the semantic web hype? Hewlett Packard laboratories presentation at MMU, 2003-03-12
The lavender portion represents biology – currently ~40,000,000,000 Triples(we and our collaborators will be doubling that number in the next 12 months)
How do you find information on this
“Semantic Web”
??
SPARQL
The query language used to discover and extract information represented in Graphs
SPARQL
Unfortunately, YOU have to know which Web resources contain which Triples
(HARD!)
Even if you do know this, SPARQL has significant limitations when attempting to
query over disparate Graphs(SLOW AND CUMBERSOME)
SPARQL
If the data doesn’t existin any Graph at all…
Basically…
A novel way of making Triples available on the Semantic Web, using a technology called Web Services
“Services” for short
Basically…
We invented SADI to overcome some/all of these problems
…but I wont bore you with the technical details…
Detour EndsPlease resume speed
Imagine there is a “virtual database” containing all of the data from all of the databases,together with the output of
every conceivable analysis
Holy Grail Demo #1
How do we query that database?
SHARESemantic Health And Research Environment
SPARQL enhanced by SADI
A Novel SPARQL Query Engine
Overcomes some of the limitations of traditional SPARQL query-handlers
A Novel SPARQL Query Engine
Overcomes some of the limitations of traditional SPARQL query-handlers
…and more…
A Novel SPARQL Query Engine
Overcomes some of the limitations of traditional SPARQL query-handlers
…and more…
MUCH more!!
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
Note that there is no “From” clause… I have neglected to tell the system where to look for the answer, I am simply asking my question
Now stick that query into SHARE
Recapwhat we just saw
A standard SPARQL query was entered into SHARE, a SADI-aware query engine
Recapwhat we just saw
The query was interpreted to extract the individual data/relationships being
requested
(and any component/sub-properties, as we shall see later!)
Recapwhat we just saw
The “triple-patterns” required to answer the query are passed to SADI for
Web Service discovery
Recapwhat we just saw
Services capable of generating those triple-patterns are automatically executed,
the triples are stored, and the query is resolved.
Recapwhat we just saw
We posed, and answered a ~complex database query
WITHOUT A DATABASE
(in fact, the data didn’t even have to exist...)
Holy Grail Demo #1
Align the promoters of all serine threonine kinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.
Retrieve and align 2000nt 5' from every serine/threonine kinase in Mus musculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.
Holy Grail Demo #2
Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
Likely Rejecter:
A patient who has creatinine levelsthat are increasing over time
- - Wilkinson MD
Likely Rejecter:
…but there is no “likely rejecter” column or table in our database…
only blood chemistry measurementsat various time-points
?
The definition of a LikelyRejecter is encoded in a machine-readable document written in the OWL language (“Ontology”)
“the regression line over creatinine measurements should have an increasing slope”
The machine continues to burrow down through the definition and discovers that regression lines have things like slopes and intercepts, etc…
Then…
Two magical events occur…
The machine figures out
by itself
the need to do a Linear Regression analysis
in order to answer your question
The machine figures out
by itself
how and where that analysiscan be done
and does it automatically!
http://www.impactlab.net/2009/03/22/improve-your-brain-power/
The SHARE system utilizes SADI to discover analytical services on the Web that do linear regression analysis
VOILA!
How do we do that?!?
We let the data describe itself!
This is a different frommost of the bioinformatics world,
where the person giving you the data also tells you how to interpret it
Data exhibits “late binding”
Late binding:
“purpose and meaning”of the data is
not determined untilthe moment it is required
Benefitof late binding
Data is amenable toconstant re-interpretation
Example?
Blood Creatinine measurements
were not dictated to be (only)
Blood Creatinine measurements!
Example?
The data had the ‘qualities/properties’ that
allowed the machine to infer
that they were Blood Creatinine measurements
Example?
But the data also had the ‘qualities/properties’ that
allowed them to be interpreted as
X/Y coordinate data by another Service
http://www.flickr.com/people/faernworks/
Holy Grail Demo #2
Align the promoters of all serine threonine kinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.
Retrieve and align 2000nt 5' from every serine/threonine kinase in Mus musculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is <40% homologous in the active site to kinases known to be involved in cell-cycle regulation in any other species.
The Holy Grail may not yet be in-handbut we can at least see it from here!
So… now what?
Mark’s Manifesto
What is my next “Holy Grail”?
Science
Support for the in silico Scientific Method
Reproducibility
Clarity (hypothesis)
Discourse
Disagreement
Clarity (experiment)
The Scientific Method
Discourse: What do you believe? What do I believe?
Disagreement: You’re wrong! And I’m gonna prove it!
Clarity: This is the experiment I am going to do
Reproducibility: This is how I did it (“provenance”)
Clarity: This is my new hypothesis
The Scientific Method
Discourse: What do you believe? What do I believe?
Disagreement: You’re wrong! And I’m gonna prove it!
Clarity: This is the experiment I am going to do
Reproducibility: This is how I did it (“provenance”)
Clarity: This is my new hypothesis
Workflows (e.g. myExperiment)
Another Brief Digression…
“Facebook” for Scientists
http://myexperiment.org
An exciting evolution in the way Researchers express and share
their in silico “Materials and Methods”
Through things called ‘Workflows’
Workflows are explicit representationsof the method by which an analysis was done
and which resources are used to do it
Workflows can be very simple…
“Blast this sequence”
Or not...
This workflow takes in a CEL file and a normalisation method then returns a series of images/graphs which represent the same output obtained using the MADAT software package (MicroArray Data Analysis Tool)
Also returned by this workflow are a list of the top differentially expressed genes (size dependant on the number specified as input - geneNumber), which are then used to find the candidate pathways which may be influencing the observed changes in the microarray data.
Why bother?
A workbench for designing and executingScientific Workflows
Taverna
Load-up your data and press “play”!
…Then go home for the weekend! You are just one click away from your M.Sc.!!
By the by…
The SHARE application automatically creates a Workflow and then automatically runs it.
This is where the data comes from to answer the queries…
Workflows are a Good Thing™
Detour EndsPlease resume speed
WORKFLOWSReproducibility
Clarity (hypothesis)
Discourse
Disagreement
Clarity (experiment)
Reproducibility
Clarity (hypothesis)
Discourse
Disagreement
Clarity (experiment)
At the moment the Semantic Web in Healthcare
and Life Sciencesaddresses these issues by attempting to create
“consensus”
Large, centralized ontologies (e.g. the Gene Ontology)that claim to represent community agreement about “biological reality”
…is that Science?
Reproducibility
Clarity (hypothesis)
Discourse
Disagreement
Clarity (experiment)
Reproducibility
Clarity (hypothesis)
Ontology Consortia
Disagreement
Clarity (experiment)
Reproducibility
Clarity (hypothesis)
Ontology Consortia
Consensus
Clarity (experiment)
Reproducibility
????
Ontology Consortia
Consensus
Clarity (experiment)
To restore the “traditions of Science”
to in silico science
The Semantic Web needs to encourage/facilitate
personal opinion and debate
What has this got to do with SADI and SHARE?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {
?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .
}
Likely Rejecter
I created a small ontology describing my definition of
a Likely Rejecter
… it was MY ontology!
I can re-use it
I can modify it as I change my world-view
Reproducibility
Clarity (hypothesis)
Discourse
Disagreement
Clarity (experiment)
I can publish it for others to use
Reproducibility
Clarity (hypothesis)
Discourse
Disagreement
Clarity (experiment)Others can modify it and/or
compare it to THEIR world-view
Reproducibility
Clarity (hypothesis)
Discourse
Disagreement
Clarity (experiment)
Sharing my ontology gives opportunities for “micro-attribution”
“Credit” to me is automatic when someone uses my ontology in their ontology/query
Using SADI and SHAREmy personal world-view is
explicitly expressed and can bedynamically evaluated against
global data and knowledge
http://www.dailymail.co.uk/femail/article-488234/Friends-dignity-self-respect---weight-wasnt-I-lost-slimming-club.html
…but there’s more…
“Likely Rejecter”
I made that up! It came out of my head!
What’s another word for a world-view that you make-up?
Hypothesis
Reproducibility
Hypotheses
Discourse
Disagreement
Clarity (experiment)The “Likely Rejecter” OWL Classis an explicitly-expressed hypothesis;
Members of that class may or may not exist!
Reproducibility
Hypotheses
Discourse
Disagreement
Experiment
Ontologically-expressed Hypotheses drive the discovery, assembly, and analysis of data capable of evaluating their validity
Blood Pressure
Hypertension
Ischemia
Hypothesis
Database 1 Database 2
SADI+
SHARE
Analytical Algorithm
Join us!
SADI and CardioSHARE are Open-Source projects
Come join us – we’re having a lot of fun!!
http://sadiframework.org
C r e d i t s
B e n j a m i n V a n d e r V a l k ( S H A R E & S A D I )
L u k e M c C a r t h y ( S A D I , S H A R E , T a v e r n a , C a r d i o S H A R E )
S o r o u s h S a m a d i a n ( C a r d i o S H A R E )
D a v i d W i t h e r s( T a v e r n a )
E d w a r d K a w a s ( S A D I S e r v i c e a u t o - g e n e r a t o r )
U o f N e w B r u n s w i c k
D r. C h r i s B a k e rA l e x a n d r e R i a z a n o v
C a r l e t o n U n i v e r s i t yD r. M i c h e l D u m o n t i e rM a r c - A l e x a n d r e N o l i nL e o n i d C h e p e l e vS t e v e E t l i n g e rN i c h a e l l a K i e t hJ o s e C r u z
Microsoft Research
Fin
This presentation available on SlideShare: keywords ‘wilkinson’ ‘iCAPTURE’ ‘HLI’