improving transparency and reproducibility of biomedical research using semantic technologies

85
Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies Mark Wilkinson World Research & Innovation Congress, Brussels, 2013 Isaac Peral Senior Researcher in Biological Informatics Centro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain Adjunct Professor of Medical Genetics, University of British Columbia Vancouver, BC, Canada.

Upload: magee

Post on 23-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies Mark Wilkinson World Research & Innovation Congress, Brussels, 2013. Isaac Peral Senior Researcher in Biological Informatics Centro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Improving Transparency and Reproducibility

of Biomedical ResearchUsing Semantic Technologies

Mark Wilkinson

World Research & Innovation Congress, Brussels, 2013

Isaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain

Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.

Page 2: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Making the Web abiomedical research platform

from hypothesis through to publication

Page 3: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Publication

Discourse

Hypothesis

Experiment

Interpretation

Page 4: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Motivation:

3 intersecting trends in the Life Sciences

that are now, or soon will be,extremely problematic

Page 5: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

NON-REPRODUCIBLE SCIENCE & THE FAILURE OF PEER REVIEW

TREND #1

Page 6: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Trend #1

Multiple recent surveys of high-throughput biology

reveal that upwards of 50% of published studies

are not reproducible

- Baggerly, 2009- Ioannidis, 2009

Page 7: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Similar (if not worse!) in clinical studies

- Begley & Ellis, Nature, 2012 - Booth, Forbes, 2012

- Huang & Gottardo, Briefings in Bioinformatics, 2012

Trend #1

Page 8: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Trend #1

“the most common errors are simple,the most simple errors are common”

At least partially because the analytical methodology was inappropriate

and/or not sufficiently described

- Baggerly, 2009

Page 9: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Trend #1

These errors pass peer review

The researcher is (sometimes) unaware of the error

The process that led to the error is not recorded

Therefore it cannot be detected during peer-review

Page 10: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Agencies have Noticed!

In March, 2012, the US Institute of Medicine ~said

“Enough is enough!”

Page 11: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Agencies have Noticed!

Institute of Medicine RecommendationsFor Conduct of High-Throughput Research:

Evolution of Translational Omics Lessons Learned and the Path Forward. The Institute of Medicine of the National Academies, Report Brief, March 2012.

1. Rigorously-described, -annotated, and -followed data management and manipulation procedures

2. “Lock down” the computational analysis pipeline once it has been selected

3. Publish the analytical workflow in a formal manner, together with the full starting and result datasets

Page 12: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

BIGGER, CHEAPER DATATREND #2

Page 13: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Trend #2

High-throughput technologies are becomingcheaper and easier to use

Page 14: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Trend #2

High-throughput technologies are becomingcheaper and easier to use

But there are still very few experts trained in statistical analysis of high-throughput data

Page 15: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Trend #2

Therefore

Even small, moderately-funded laboratories can now afford to produce more data

than they can manage or interpret

Page 16: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

“THE SINGULARITY”TREND #3

Page 17: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009

Slide adapted with permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.

Page 18: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.

“The Singularity”

The X-intercept is where, the moment a discovery is made, it is immediately put into practice

Page 19: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Scientific research would have to be conducted within a medium that

immediately interpreted and disseminated the results...

You Are

Here

Page 20: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

...in a form that immediately (actively!) affected the results of other researchers...

You Are

Here

Page 21: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

...without requiring them to be aware of these new discoveries.

You Are

Here

Page 22: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

3 intersecting and problematic trends

Non-reproducible science that passes peer-review

Cheaper production of larger and more complex datasetsthat require specialized expertise to analyze properly

Need to more rapidly disseminate and use new discoveries

Page 23: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

We Want More!

Page 24: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

I don’t just want to reproduceyour experiment...

Page 25: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

I want to re-use your experiment

Page 26: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

In my own laboratory... On MY DATA!

Page 27: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

When I do my analysis

I want to draw on the knowledge

of global domain-experts like

statisticians and pathologists...

...as if they were mentors sitting

in the chair beside me.

Page 28: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Image from: Mark Smiciklas Intersection Consulting, cc-nca

Please don’t make me find

all of the data and knowledge

that I require to do my experiment

...it simply isn’t possible anymore...

Page 29: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Image from AJ Canncc-by-a license

I want to support peer review(ers)so that I do better science.

Page 30: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

How do we get there from here?

Page 31: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

To overcome these intersecting problems

and to achieve the goals of transparentreproducible research

Page 32: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

We must learn how to do research IN the Web

Not OVER the Web

Page 33: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

How we use The Web today

Page 34: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

The Web is not a pigeon!

Page 35: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Semantic Web Technologies

Page 36: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Design Pattern for PublishingAnalytical Tools on the Semantic Web

Page 37: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Application that uses SADIto interpret globally-distributed

expert knowledge

in order to discover and executethe right tool, at the right time, for the right analysis

Page 38: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Reproduce a peer-reviewed scientific publication

by semantically modellingthe problem

CHALLENGE:

Page 39: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

The PublicationDiscovering Protein Partners of aHuman Tumor Suppressor Protein

Page 40: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Original Study Simplified

Using what is known about protein interactions

in fly & yeast

predict new interactions with this Human Tumor Suppressor

Page 41: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Semantic Model of the Experiment

OWL

Web Ontology Language (OWL) is the language approved by the W3C

for representing knowledge in the Web

Page 42: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Note that every word in this diagram is, in reality, a URL (it’s a Semantic Web model)

i.e. It refers to the expertise of other researchers, distributedaround the world on the Web(i.e. NanoPubs***)

*** remember this word!! It will be important later!!

Semantic Model of the Experiment

Page 43: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

In a local data-file

provide the protein we are interested in

and the two species we wish to use in our comparison

taxon:9606 a i:OrganismOfInterest . # humanuniprot:Q9UK53 a i:ProteinOfInterest . # ING1taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly

Set-up the Experimental Conditions

Page 44: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE { ?protein a i:ProbableInteractor .

}

Run the Experiment

Page 45: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE { ?protein a i:ProbableInteractor .

}

Run the Experiment

This is the URL that leads our computerto the Semantic model of the problem

Page 46: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

SHARE examines the semantic model of Probable Interactors

Retrieves third-party expertise from the Web

Discusses with SADI what analytical tools are necessary

Chooses the right tools for the problem

Solves the problem!

Page 47: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

SHARE derives (and executes) the following analysis automatically

Page 48: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

SHARE is aware of the context of the specific question being asked

Page 49: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies
Page 50: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

There are four very cool things about what you just saw...

Page 51: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

There are four very cool things about what you just saw...

was able to create a workflow based on a

semantic model

Page 52: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

There are four very cool things about what you just saw...

was able to create a COMPUTATIONAL workflow

based on a BIOLOGICAL model

Page 53: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

There are four very cool things about what you just saw...

(this is important because we wantthis system to be used by clinicians and biologists

who don’t speak computerese!)

Page 54: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

There are four very cool things about what you just saw...

The workflow it created, and services chosen, differed depending on the context of the

specific question being asked

taxon:4932 a i:ModelOrganism1 . # yeast

taxon:7227 a i:ModelOrganism2 . # fly

Page 55: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

There are four very cool things about what you just saw...

The choice of tool-selection was

guided by the knowledge of worldwide domain-experts encoded in

globally-distributed ontologies

(e.g. Expert high-throughput statisticians, etc...)

Page 56: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

We have not over-trivialized the problemof interpreting clinical data...

Page 57: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Measurement Units

One example of the “little ways” that Semantics will help clinical researchers

day-by-day

Page 58: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Units must be harmonized

Don’t leave this up to the researcher(it’s fiddly, time-consuming, and error-prone)

Page 59: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

NASA Mars Climate Orbiter

Page 60: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Oops!

Page 61: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

ID

 

HEIGHT

 WEIGHT

 

SBP CHOL

 

HDL

 

BMI

GR

SBP

GR

CHOL

GR

HDL

GR

pt1 1.82 177 128 227 55 0 0 1 0

pt2 179 196 13.4 5.9 1.7 1 0 1 0

The Chaos of Real-world Clinical Datasets(this is a snapshot of an actual dataset we worked on)

Height in m and cm Chol in mmol/l and mg/l

...and other delicious weirdness

Page 62: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

GOAL: get the clinical researcher “out of the loop” once the data is collected

(as per the Institute of Medicine Recommendations)

Page 63: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Semantically defining clinical phenotypes;Building on the expertise of others

SystolicBloodPressure =

GALEN:SystolicBloodPressure and ("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and (“om:dimension” value “om:pressure or stress dimension”) and "sio:has value" some rdfs:Literal))

Very general definition“some kind of pressure unit”

(so that others can build on this as they wish!)

Page 64: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

HighRiskSystolicBloodPressure (as defined by Framingham)

SystolicBloodPressure and sio:hasMeasurement some (sio:Measurement and (“sio:has unit” value om:kilopascal) and (sio:hasValue some double[>= "18.7"^^double])))

Now we are specific to our clinical study:MUST be in kpascal and must be > 18.7

Semantically defining clinical phenotypes;Building on the expertise of others

Page 65: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

SELECT ?record ?convertedvalue ?convertedunitFROM <./patient.rdf> WHERE {

?record rdf:type measure:HighRiskSystolicBloodPressure . ?record sio:hasMeasurement ?measurement. ?measurement sio:hasValue ?Pressure. }

RecordID Start Val Start Unit Pressure End Unit Pt1 15 cmHg 19.998 KiloPascalPt2 14.6 cmHg 19.465 KiloPascalPt1 148 mmHg 19.731 KiloPascalPt2 146 mmHg 19.465 KiloPascal

Running the Clinical Analysis

All measurements have now been automaticallyharmonized to KiloPascal, because we encoded thesemantics in the model

Page 66: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Visual inspection of our output data and the AHA guidelines

showed that in many cases the clinician

“tweaked” the guidelines when doing their own analysis

------------------AHA BMI risk threshold: BMI=25

In our dataset the clinical researcher used BMI=26------------------

AHA HDL guideline HDL<=1.03mmol/lThe dataset from our researcher: HDL<=0.89mmol/l

-------------------

Page 67: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Visual inspection of our output data and the AHA guidelines

showed that in many cases the clinician

“tweaked” the guidelines when doing their own analysis

These Alterations Were Not Recorded in Their Study Notes!

Page 68: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Adjusting our Semantic definitions and re-running the analysisresulted in nearly 100% correspondence with the clinical researcher

HighRiskCholesterolRecord=

PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0]))))

HighRiskCholesterolRecord=

PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.2]))))

Page 69: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Reflect on this for a second... Because this is important!

1. We semantically encoded clinical guidelines

2. We found that clinical researchers did not follow the official guidelines

3. Their “personalization” of the guidelines was unreported

4. Nevertheless, we were able to create “personalized” Semantic Models

5. These reflect the opinion of an individual domain-expert

6. These models are shared on the Web

7. Can be automatically re-used by others to interpret their own data using

that clinical expert’s viewpoint

Page 70: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

AHA:HighRiskCholesterolRecord

PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0]))))

McManus:HighRiskCholesterolRecord

PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.2]))))

PREFIX AHA =http://americanheart.org/measurements/

PREFIX McManus=http://stpaulshospital.org/researchers/mcmanus/

Page 71: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

To do the analysis using AHL guidelines

SELECT ?patient ?risk

WHERE {

?patient rdf:type AHA: HighRiskCholesterolRecord .

?patient ex:hasCholesterolProfile ?risk

}

Page 72: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

To do the analysis using McManus’ expert-opinion

SELECT ?patient ?risk

WHERE {

?patient rdf:type McManus:HighRiskCholesterolRecord .

?patient ex:hasCholesterolProfile ?risk

}

Page 73: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Flexibility Transparency

Reproducibility Shareability Comparability

Simplicity Automation

Page 74: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Two final points....

Page 75: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Publication

Discourse

Hypothesis

Experiment

Interpretation

??

Page 76: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

The Semantic Model represents a possible solution to a problem

By my definition, that is a hypothesis

Page 77: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

The Semantic Model represents a possible solution to a problem

That hypothesis is tested by automatically converting it into a workflow;the results of the workflow are intimately tied to the hypothesis

Page 78: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

The Semantic Model represents a possible solution to a problem

i.e. You (or anyone!) can determine exactly which aspect of the hypothesis led to which output data element, why, and how

Page 79: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

The Semantic Model represents a possible solution to a problem

“Exquisite Provenance”

Page 80: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

And this is important because...

Page 81: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Note that every word in this diagram is, in reality, a URL (it’s a Semantic Web model)

i.e. It refers to the expertise of other researchers, distributedaround the world on the Web(i.e. NanoPubs***)

*** remember this word!! It will be important later!!

Remember when I said this...?

Page 82: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

“Exquisite Provenance”

is required

for the output data and knowledgeto be published as...

Page 83: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Semantic Web-based, richly annotated, citable, and queryablesnippets of scientific knowledge

(that can be used to construct novel SHARE hypotheses!)

Page 84: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Life Web Science:

The Semantic Web is a cradle-to-grave biomedical research platform

that can, and will, dramatically improve how biomedical research is done

WeAre

Here!

Page 85: Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Microsoft Research