data: the good, the bad & the ugly

47
Data: The Good , The Bad & The Ugly Lee Harland @ SciBitely http :// www.scibite.com http:// www.slideshare.net / scibitely Lee Harland Lilly Global IT Meeting November 2016

Upload: lee-harland

Post on 10-Feb-2017

118 views

Category:

Healthcare


3 download

TRANSCRIPT

Page 1: Data: The Good, The Bad & The Ugly

Data: The Good, The Bad& The Ugly

Lee Harland @SciBitely

http://www.scibite.comhttp://www.slideshare.net/scibitely

Lee HarlandLilly Global IT Meeting November 2016

Page 2: Data: The Good, The Bad & The Ugly

Context• This is an invited talk I gave at Lilly’s Internal Global IT meeting on the

subject of “data”

Page 3: Data: The Good, The Bad & The Ugly

The Good

Page 4: Data: The Good, The Bad & The Ugly

http://www.nejm.org/doi/full/10.1056/NEJMp1606181

Page 5: Data: The Good, The Bad & The Ugly
Page 6: Data: The Good, The Bad & The Ugly
Page 7: Data: The Good, The Bad & The Ugly
Page 8: Data: The Good, The Bad & The Ugly

What matters to me!

Page 9: Data: The Good, The Bad & The Ugly

The Bad

Page 10: Data: The Good, The Bad & The Ugly

+ =

…. (Promotion of) the nutritional importance of spinach over other foods, lead to an increase of over 30 per cent in its

consumption during the 1920s and 30s.

The action of S. Oleracea on cardiovascular output and muscular tone

Page 11: Data: The Good, The Bad & The Ugly

Bad, Bad Data Point

1870 35.2 mg Fe/100g1937 3.52 mg Fe/100g

The mythical strength-giving properties of spinach are ... credited to a simple mistake concerning the iron content of the vegetable.

In 1870, Dr E von Wolf published figures which were accepted until the 1930s, when they were rechecked

This revealed that a decimal point had been placed wrongly and that the real figure was only one tenth of Dr von Wolf's claim

Page 12: Data: The Good, The Bad & The Ugly

Still Making Headlines After 140 Years2013

Page 13: Data: The Good, The Bad & The Ugly

There Is No Decimal Point

Error

Page 14: Data: The Good, The Bad & The Ugly

X X

Page 15: Data: The Good, The Bad & The Ugly

X

Page 16: Data: The Good, The Bad & The Ugly

Spinach: One Small Data Point, One Huge Mess

1870 35.2 mg Fe/100g1937 3.52 mg Fe/100g

✓✓

Both Values Are Correct – The difference is down to the assay conditions

Page 17: Data: The Good, The Bad & The Ugly

http://www.merriam-webster.com/dictionary/provenance

Page 18: Data: The Good, The Bad & The Ugly

35.2

35.2

The datapoint + its provenance (experimental context)

What people saw

Page 19: Data: The Good, The Bad & The Ugly

So What?

Page 20: Data: The Good, The Bad & The Ugly

……estimates for the reproducibility of preclinical research range from 51 percent to 89 percent. They estimate that at least half of all U.S. preclinical biomedical research funding—about $28 billion annually—is therefore squandered……

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165

Page 21: Data: The Good, The Bad & The Ugly

http://www.merriam-webster.com/dictionary/provenance

Page 22: Data: The Good, The Bad & The Ugly

Provenance Is A Critical Component of Reproducibility

What L cells, where from, how old, epigenetic profile

etc etc?

When, how often, in what way, using what

system?????

What, when, how?

Could you accurately reproduce this experiment from this method?

* I was responsible for this paragraph

Page 23: Data: The Good, The Bad & The Ugly

http://www.nature.com/nrd/journal/v10/n9/full/nrd3439-c1.html

A first-of-a-kind analysis of Bayer's internal efforts to validate 'new drug target' claims now not only supports this view but suggests that 50% may be an underestimate; the company's in-house experimental data do not match literature claims in 65% of target-validation projects, leading to project discontinuation.

Page 24: Data: The Good, The Bad & The Ugly

This is where Informatics & Data Science can add real

value toDrug Discovery

Page 25: Data: The Good, The Bad & The Ugly

Open PHACTS https://www.openphacts.org/

Page 26: Data: The Good, The Bad & The Ugly

Open PHACTS: Adding Provenance To Data

http://nanopub.org/

Page 27: Data: The Good, The Bad & The Ugly

.sub:Head {this: np:hasAssertion sub:assertion ;np:hasProvenance sub:provenance ;np:hasPublicationInfo sub:pubinfo ;a np:Nanopublication .}

sub:assertion {nx:NX_P35712 bfo:BFO_0000066 ts:TS-0276 ; # Protein NX_P35712 is localized in tissue TS-0276ro:has_quality "positive" .}

sub:provenance {<http://www.nextprot.org/help/quality_criteria/silver> a eco:ECO_0000205 ;rdfs:label "neXtProt silver"^^xsd:string .sub:_1 a efo:EFO_00027688 .sub:_10 a eco:ECO_0000218 .sub:_2 a eco:ECO_0000218 .sub:_9 a efo:EFO_00027688 .sub:assertion prv:usedData <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000087&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000088&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000090&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693&amp;stage_children=on> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000092&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693&amp;stage_children=on> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000094&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693&amp;stage_children=on> ;wi:evidence <http://www.nextprot.org/help/quality_criteria/silver> ;a eco:ECO_0000220 ;rdfs:comment " data, NX_P35712 is expressed in Endometrium"^^xsd:string ;prov:wasDerivedFrom sub:_1 , sub:_3 , sub:_5 , sub:_7 , sub:_9 ;prov:wasGeneratedBy sub:_10 , sub:_2 , sub:_4 , sub:_6 , sub:_8 .}

sub:pubinfo {sub:_11 a eco:ECO_0000205 .sub:_12 a eco:ECO_0000205 . sub:_15 a eco:ECO_0000205 .this: dcterms:created "2014-09-19T00:00:00.0Z"^^xsd:dateTime ;dcterms:rights <http://creativecommons.org/licenses/by/3.0/> ;dcterms:rightsHolder <http://nextprot.org> ;prv:usedData "neXtProt database" ;pav:authoredBy "CALIPHO project" , <http://orcid.org/0000-0001-6710-1373> , <http://orcid.org/0000-0001-6818-334X> , <http://orcid.org/0000-0002-1303-2189> , <http://orcid.org/0000-0003-1813-6857> ;pav:versionNumber "3" ;prov:wasGeneratedBy sub:_11 , sub:_12 , sub:_13 , sub:_14 , sub:_15 .} http://nanopub.org

Page 28: Data: The Good, The Bad & The Ugly

https://explorer.openphacts.org

Page 29: Data: The Good, The Bad & The Ugly

One of the few user interfaces where provenance is intrinsically “there”

Page 30: Data: The Good, The Bad & The Ugly

The Ugly

Page 31: Data: The Good, The Bad & The Ugly

80-90% of all potentially usable business information may originate in unstructured form

https://en.wikipedia.org/wiki/Unstructured_data

The Ugly

Page 32: Data: The Good, The Bad & The Ugly

“Carboxypeptidase B2” “Thrombin-ActivatableFibrinolysis Inhibitor”

“Plasma CPU”

The True Picture(they are the same thing)

Page 33: Data: The Good, The Bad & The Ugly

It hasn’t just got 3 names its got LOTScarboxypeptidase B-like protein OR thrombin-activatable fibrinolysis

inhibitor OR CPB type 2 OR Carboxypeptidase type B2 OR plasma carboxypeptidase type B OR carboxypeptidase type B2 OR

CPB2 OR Plasma carboxypeptidase type B OR CPB-2 OR carboxypeptidase B2 (plasma),carboxypeptidase U OR

Carboxypeptidase type U OR carboxypeptidase type U OR plasma carboxypeptidase B2 OR carboxy-peptidylase U OR thrombin-

activable fibrinolysis inhibitor OR plasma carboxypeptidase type B2 OR carboxypeptidase B2 (plasma OR CPU OR

carboxypeptidase B2 OR PCPB OR pCPB OR Carboxypeptidase U OR plasma carboxypeptidase B OR TAFI OR Carboxypeptidase B2

OR Plasma carboxypeptidase B OR Thrombin-activablefibrinolysis inhibitor OR carboxypeptidase B2 plasma OR

carboxypeptidase R

Page 34: Data: The Good, The Bad & The Ugly

“We also manually standardized data related to lab measurement units and terminology related to patient race and ethnicity, geographical study regions, and names of drugs and drug families. “

Yet Another Issue

Page 35: Data: The Good, The Bad & The Ugly

(an accident waiting to happen)

Page 36: Data: The Good, The Bad & The Ugly

VARCHAR2PROJ_TITLE

EXPERIMENT_INFO

ASSAY_DESCRIPTION

KEYWORDS

USER_PROFILE SUMMARY

EXPT_METADATA

SETTINGS_INFO

REPORT_TEXT

EXPT_NAME

Databases: Where Knowledge Goes To Die

MEETING_MINUTES

PROJ_ACTIONS

ASSAY_CONLCUSIONCOHORT_DESC

INCLUSION_CRITERIA

POLICY_DETAILS

PROJECT_OVERVIEWRATIONALE

JUSTIFICATION

Page 37: Data: The Good, The Bad & The Ugly

Text2Data MicroService

TERMiteSupports basic keyword search only

TEXT Rich substrate for search and discovery & insight

DATA

Page 38: Data: The Good, The Bad & The Ugly

Just What Is “The Data”?• Mentions of all

• Genes, Diseases, Drugs, Tissues, Cells, Techniques, Assays, Measures, Protocols, Compounds, Regimens, Companies, People, Locations, Pathologies, Adverse Events, Pathways, Metabolism, Manufacturing Concepts, QC/QA, Pathogens, Strains, Animals … and so on...

• … And their relationships to each other• … And their locations (section, database column)• … Inferring relationships between documents/entries• … Regardless of actual keyword used

Page 39: Data: The Good, The Bad & The Ugly

Systems Integration Guide

http://yourcompany.com/termite?text=<content>app=<application name>index=<e.g. page, table or column name>

Page 40: Data: The Good, The Bad & The Ugly

ELN Screening Registry

PDMRegistry

ProjectManagement Sharepoint

Whats going on, right now

Page 41: Data: The Good, The Bad & The Ugly

Trending Today

Page 42: Data: The Good, The Bad & The Ugly
Page 43: Data: The Good, The Bad & The Ugly
Page 44: Data: The Good, The Bad & The Ugly

Why Give Ugly Data A Makeover?• ELN annotation using Bioassay Ontology

• Find all experiments using any Cell Flourescence technique”• Pharmacovigilance

• Monitoring newsfeeds & internal data for safety signals• Automatic Process Notification

• Alert groups based on content of CRO documents Etc• Synergise Both Semantic Technology & Information Professionals

• Re-energise Therapeutic Area Literature Searching• Build Knowledge Chains (Assertional Provenance)

• Project Management à ELN Data à Screen SOP

Page 45: Data: The Good, The Bad & The Ugly

Before I go…..

Page 46: Data: The Good, The Bad & The Ugly

Spinach: The Truth Is Out There!

Spinach is highin iron (!)

..oxalic acid in spinach prevents more than 90% of iron from being

absorbed..

Acknowledgement

Page 47: Data: The Good, The Bad & The Ugly

Acknowledgements

IMI Open PHACTS Team(many more involved, I just don’t have a photo L )http://openphacts.org

SciBite Teamhttp://scibite.com