technologies, methods and challenges to data sharing and aggrigation

90
Technologies, methods and challenges to effective public data sharing and aggregation Mark D Wilkinson Medical Genetics, UBC PI Bioinformatics Heart + Lung Institute at St. Paul’s Hospital [email protected] http://wilkinsonlab.ca

Upload: mark-wilkinson

Post on 18-Dec-2014

416 views

Category:

Technology


2 download

DESCRIPTION

Presentation to the "Pest Practices in Personalized Medicine" (B2PM) Conference, Vancouver, BC, Canada. March, 2011.

TRANSCRIPT

Page 1: Technologies, methods and challenges to data sharing and aggrigation

Technologies, methods and challengesto effective public data sharing and aggregation

Mark D WilkinsonMedical Genetics, UBC

PI BioinformaticsHeart + Lung Institute at St. Paul’s Hospital

[email protected]://wilkinsonlab.ca

Page 2: Technologies, methods and challenges to data sharing and aggrigation

Thanks in advance(very few of these ideas are my own!)

Carole Goble – University of Manchester

Paul Gordon – Sun Center of Excellence, U Calgary

Charles Petrie – Stanford University

Page 3: Technologies, methods and challenges to data sharing and aggrigation

We’ve come a long way!!

Page 4: Technologies, methods and challenges to data sharing and aggrigation

In the beginning...

Page 5: Technologies, methods and challenges to data sharing and aggrigation

Link Integration

Page 6: Technologies, methods and challenges to data sharing and aggrigation
Page 7: Technologies, methods and challenges to data sharing and aggrigation

Integration of What?

Page 8: Technologies, methods and challenges to data sharing and aggrigation

Specially-formatted Files

EMBL Format

Page 9: Technologies, methods and challenges to data sharing and aggrigation

Specially-formatted Files

GenBankFormat

Page 10: Technologies, methods and challenges to data sharing and aggrigation

Specially-formatted Files

FASTA Format

Page 11: Technologies, methods and challenges to data sharing and aggrigation

Specially-formatted Files

GCG Format

Page 12: Technologies, methods and challenges to data sharing and aggrigation

At least 20 different formats for representing DNA sequences...

Lord et al. 2004

Page 13: Technologies, methods and challenges to data sharing and aggrigation

Many formats contained a wide variety of related,

but different, information

DNA sequenceSequence features

TranslationDate/time/method

Publication cross-references......

Page 14: Technologies, methods and challenges to data sharing and aggrigation

Each file format required it’s own parser...

Page 15: Technologies, methods and challenges to data sharing and aggrigation

...and that problem wasn’t limited to DNA...

Page 16: Technologies, methods and challenges to data sharing and aggrigation

XML

Page 17: Technologies, methods and challenges to data sharing and aggrigation

What did XML do for us?

“...advent of XML meant that we didn’t have to write our own parsers anymore...”

Individual data elements in a file can be automatically located and extracted

Page 18: Technologies, methods and challenges to data sharing and aggrigation

Predictable way to represent data

Makes it easier for machines to encode/extract

Page 19: Technologies, methods and challenges to data sharing and aggrigation

EMBL Record for BRCA1In XML

Page 20: Technologies, methods and challenges to data sharing and aggrigation

GenBank Record for BRCA1In XML

Page 21: Technologies, methods and challenges to data sharing and aggrigation

So now we can share and aggregate data!

Page 22: Technologies, methods and challenges to data sharing and aggrigation

So now we can share and aggregate data!

Page 23: Technologies, methods and challenges to data sharing and aggrigation

EMBL Record for BRCA1In XML

GenBank Record for BRCA1In XML

Page 24: Technologies, methods and challenges to data sharing and aggrigation

So now we can share and aggregate data!

...because it isn’t (just) a parsing problem...

Various resources have various data models

Page 25: Technologies, methods and challenges to data sharing and aggrigation

So... Let’s find a way to describe the data models!

Page 26: Technologies, methods and challenges to data sharing and aggrigation

XML Schema

Page 27: Technologies, methods and challenges to data sharing and aggrigation

XML SchemaThere will be an element called “GBQualifier”There will be a child element called “GBQualifier_name”The content of that child element will be free-textThere will be a child element called “GBQualifier_value”The content of that child element will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child element called “value”The content of that child element will be free-text

Page 28: Technologies, methods and challenges to data sharing and aggrigation

So now we can share and aggregate data!

Page 29: Technologies, methods and challenges to data sharing and aggrigation

So now we can share and aggregate data!

Page 30: Technologies, methods and challenges to data sharing and aggrigation

What did XML Schema do for us?

“...XML Schema (among other things) allowed us to ~automate the creation of (in-memory) Structures which could hold the given XML-formatted data...”

Page 31: Technologies, methods and challenges to data sharing and aggrigation

Does not solve the integration or aggregation problem

Page 32: Technologies, methods and challenges to data sharing and aggrigation

XML SchemaThere will be an element called “GBQualifier”There will be a child element called “GBQualifier_name”The content of that child element will be free-textThere will be a child element called “GBQualifier_value”The content of that child element will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child element called “value”The content of that child element will be free-text

Page 33: Technologies, methods and challenges to data sharing and aggrigation

XML SchemaThere will be an element called “GBQualifier”There will be a child element called “GBQualifier_name”The content of that child element will be free-textThere will be a child element called “GBQualifier_value”The content of that child element will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child element called “value”The content of that child element will be free-text

Because the “meaning” of eachelement is implicit, we resort to

“Schema Mapping” to integrate the data

Page 34: Technologies, methods and challenges to data sharing and aggrigation

XML SchemaThere will be an element called “GBQualifier”There will be a child element called “GBQualifier_name”The content of that child element will be free-textThere will be a child element called “GBQualifier_value”The content of that child element will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child element called “value”The content of that child element will be free-text

Page 35: Technologies, methods and challenges to data sharing and aggrigation

XML SchemaThere will be an element called “GBQualifier”There will be a child element called “GBQualifier_name”The content of that child element will be free-textThere will be a child element called “GBQualifier_value”The content of that child element will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child element called “value”The content of that child element will be free-text

Page 36: Technologies, methods and challenges to data sharing and aggrigation

XML SchemaThere will be an element called “GBQualifier”There will be a child element called “GBQualifier_name”The content of that child element will be free-textThere will be a child element called “GBQualifier_value”The content of that child element will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child element called “value”The content of that child element will be free-text

AutomaticallyMap Schema, then

we can integrate data!

Page 37: Technologies, methods and challenges to data sharing and aggrigation

XML SchemaThere will be an element called “GBQualifier”There will be a child element called “GBQualifier_name”The content of that child element will be free-textThere will be a child element called “GBQualifier_value”The content of that child element will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child element called “value”The content of that child element will be free-text

Page 38: Technologies, methods and challenges to data sharing and aggrigation

Nevertheless...

Page 39: Technologies, methods and challenges to data sharing and aggrigation

Web Services

“Service Oriented Architectures”

WSDL (and many other 4-letter words)

Page 40: Technologies, methods and challenges to data sharing and aggrigation

Web Services & SOA’s

Allow you to expose software (e.g. a database, analytical tool, or service)

on the Webso that others can use it

(in their own analytical pipelines)

Page 41: Technologies, methods and challenges to data sharing and aggrigation

Excellent!!

Page 42: Technologies, methods and challenges to data sharing and aggrigation

But...

Page 43: Technologies, methods and challenges to data sharing and aggrigation

XML Schema

Page 44: Technologies, methods and challenges to data sharing and aggrigation

XML SchemaThere will be an element called “GBQualifier”There will be a child attribute called “GBQualifier_name”The content of that child attribute will be free-textThere will be a child attribute called “GBQualifier_value”The content of that child attribute will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child attribute called “value”The content of that child attribute will be free-text

Page 45: Technologies, methods and challenges to data sharing and aggrigation

“The phrase ‘practical Web Services’ is not intrinsically an oxymoron, but [I] argue that there are few in existence.”

Page 46: Technologies, methods and challenges to data sharing and aggrigation

Why?

Page 47: Technologies, methods and challenges to data sharing and aggrigation

XML SchemaThere will be an element called “GBQualifier”There will be a child attribute called “GBQualifier_name”The content of that child attribute will be free-textThere will be a child attribute called “GBQualifier_value”The content of that child attribute will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child attribute called “value”The content of that child attribute will be free-text

Because this problem is so disruptivethat there is little point in building

“public” Web Services...

They are simply too difficult to integrate with other “public” Web Services.

-- adapted from Petrie, SWSIP 2009

Page 48: Technologies, methods and challenges to data sharing and aggrigation

...and that’s pretty muchwhere the world is right now...

Page 49: Technologies, methods and challenges to data sharing and aggrigation

But there is hope!

Page 50: Technologies, methods and challenges to data sharing and aggrigation

“Linked Data” movement

Resource Description Framework“RDF”

The “Semantic Web” movement

Web Ontology Language“OWL” (+ RDF)

Two new technologies & communities

Page 51: Technologies, methods and challenges to data sharing and aggrigation

What does RDF do for us?

“...RDF replaces XML Schema, because RDF says that there is only one data model...”

Page 52: Technologies, methods and challenges to data sharing and aggrigation

What does OWL do for us?

“...the semantics are no longer implicit in the data model...”

XML SchemaThere will be an element called “GBQualifier”There will be a child attribute called “GBQualifier_name”The content of that child attribute will be free-textThere will be a child attribute called “GBQualifier_value”The content of that child attribute will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child attribute called “value”The content of that child attribute will be free-text

Page 53: Technologies, methods and challenges to data sharing and aggrigation

So what?

Page 54: Technologies, methods and challenges to data sharing and aggrigation

The Semantic Web

Gives us the opportunity to re-thinkhow we build our health data infrastructures

Page 55: Technologies, methods and challenges to data sharing and aggrigation

The Semantic Web

isn’t

“yet another layer of technology”

Page 56: Technologies, methods and challenges to data sharing and aggrigation

The Semantic Web

changes

the way we write software

Page 57: Technologies, methods and challenges to data sharing and aggrigation

The Semantic Web

What to do & How to do itis no longer encoded in your software

Page 58: Technologies, methods and challenges to data sharing and aggrigation

The Semantic Web

What to do & How to do itis part of the data

Page 59: Technologies, methods and challenges to data sharing and aggrigation

The Semantic Web

What to do & How to do itis part of a shared,

expert understanding

Page 60: Technologies, methods and challenges to data sharing and aggrigation

The Semantic Web

What to do & How to do itIS* PERSONAL!

* can be...

Page 61: Technologies, methods and challenges to data sharing and aggrigation

One piece of software

Any question... Any answer

Page 62: Technologies, methods and challenges to data sharing and aggrigation

Let me demonstrate what I mean

Page 63: Technologies, methods and challenges to data sharing and aggrigation

Founding partner

Semantic Automated Discovery and Integrationhttp://sadiframework.org

MicrosoftResearch

Page 64: Technologies, methods and challenges to data sharing and aggrigation

Semantic Health And Research Environment

(a Semantic Web question answerer...)

Page 65: Technologies, methods and challenges to data sharing and aggrigation

Example #1

Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants

SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

Page 66: Technologies, methods and challenges to data sharing and aggrigation

Likely Rejecter:

A patient who has creatinine levelsthat are increasing over time

- - Wilkinson “MD”

Page 67: Technologies, methods and challenges to data sharing and aggrigation

Likely Rejecter:

…but there is no “likely rejecter” column or table in our database…

Page 68: Technologies, methods and challenges to data sharing and aggrigation

Likely Rejecter:

Our database contains various blood chemistry measurements

at various time-points

Page 69: Technologies, methods and challenges to data sharing and aggrigation

SemanticsSHARE determines

by itself

the need to do a Linear Regression analysis over

Creatinine blood chemistry measurements

Page 70: Technologies, methods and challenges to data sharing and aggrigation

SemanticsSHARE determines

by itself

how and where that analysiscan be done

and does it

Page 71: Technologies, methods and challenges to data sharing and aggrigation

The SHARE system utilizes Semantics (via SADI) to discover and access analytical services on the Web that do linear regression analysis

Page 72: Technologies, methods and challenges to data sharing and aggrigation

VOILA!

Page 73: Technologies, methods and challenges to data sharing and aggrigation

Neither SADI nor SHARE

know anything about

blood chemistry, or mathematics

Page 74: Technologies, methods and challenges to data sharing and aggrigation

Example #2

From a (contrived) integrated dataset, retrieve the blood pressure measurements

SELECT ?output ?unit ?value FROM <http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl> WHERE {

?output rdf:type sbp:BloodPressure . ?output local:hasCanonicalAttribute ?pr . ?pr sio:SIO_000221 ?unit . ?pr sio:SIO_000300 ?value .

}

Page 75: Technologies, methods and challenges to data sharing and aggrigation

This should be extremely straightforward...

Page 76: Technologies, methods and challenges to data sharing and aggrigation

...except for one problem...

Page 77: Technologies, methods and challenges to data sharing and aggrigation

<owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance1"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/> <resource:SIO_000300>0.137</resource:SIO_000300> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/meter-of-mercury-column"/> </owl:NamedIndividual>

<owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance2"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/> <resource:SIO_000300>12.45</resource:SIO_000300> <resource:SIO_000221 rdf:resource="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#centi-meter-of-mercury-column"/> </owl:NamedIndividual>

<owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance3"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/> <resource:SIO_000300>5.3</resource:SIO_000300> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/inch-of-mercury-column"/> </owl:NamedIndividual>

Page 78: Technologies, methods and challenges to data sharing and aggrigation

Spot the problem?

Page 79: Technologies, methods and challenges to data sharing and aggrigation

Let me help...

Page 80: Technologies, methods and challenges to data sharing and aggrigation

<owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance1"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/>

<resource:SIO_000300>0.137</resource:SIO_000300>

<resource:SIO_000221 rdf:resource="&ucum;unit/pressure/meter-of-mercury-column"/> </owl:NamedIndividual>

<owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance2"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/>

<resource:SIO_000300>12.45</resource:SIO_000300>

<resource:SIO_000221 rdf:resource=“&ucum;unit/framingham/sbpfeb.owl#centi-meter-of-mercury-column"/> </owl:NamedIndividual>

<owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance3"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/>

<resource:SIO_000300>5.3</resource:SIO_000300>

<resource:SIO_000221 rdf:resource="&ucum;unit/pressure/inch-of-mercury-column"/>

</owl:NamedIndividual>

Page 81: Technologies, methods and challenges to data sharing and aggrigation

SemanticsExample #2

From a (contrived) integrated dataset, retrieve the blood pressure measurements

SELECT ?output ?unit ?value FROM <http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl> WHERE {

?output rdf:type sbp:BloodPressure . ?output local:hasCanonicalAttribute ?pr . ?pr sio:SIO_000221 ?unit . ?pr sio:SIO_000300 ?value .

}

My semantic definition of “Blood Pressure”includes the units that I want...

Page 82: Technologies, methods and challenges to data sharing and aggrigation

SemanticsThis is enough to trigger SHARE to automatically discover

an online unit-conversion service...

Page 83: Technologies, methods and challenges to data sharing and aggrigation
Page 84: Technologies, methods and challenges to data sharing and aggrigation

Neither SADI nor SHARE

know anything about

units or unit conversions

Page 85: Technologies, methods and challenges to data sharing and aggrigation

Many of the challenges to data aggregation and sharingnow have solutions that work!

Page 86: Technologies, methods and challenges to data sharing and aggrigation

What, in my opinion, is the greatest remaining challenge?

Page 87: Technologies, methods and challenges to data sharing and aggrigation

To a biologist...

...“data mining” means “this data is mine!”

Page 88: Technologies, methods and challenges to data sharing and aggrigation

The challenge to us all

Move from Data Mine-ing

To Data Ours-ing

-- Len Silverston, 2007

Page 89: Technologies, methods and challenges to data sharing and aggrigation

We’ve come a long way!!

XMLXML Schema

There will be an element called “GBQualifier”There will be a child attribute called “GBQualifier_name”The content of that child attribute will be free-textThere will be a child attribute called “GBQualifier_value”The content of that child attribute will be free-text

XML SchemaThere will be an element called “qualifier”It will have an attribute called “name”The content of that attribute will be textThere will be a child attribute called “value”The content of that child attribute will be free-text

Page 90: Technologies, methods and challenges to data sharing and aggrigation

Microsoft Research

TEAM:

Luke McCarthyBenjamin Vandervalk

Soroush Samadian