bmi research in progress - thursday talk

55
Query Federa*on over the Life Sciences Linked Open Data Cloud Maulik R. Kamdar Musen Lab BMIR Research in progress Talk November 10, 2016

Upload: maulik-kamdar

Post on 17-Jan-2017

22 views

Category:

Science


1 download

TRANSCRIPT

Page 1: BMI Research in Progress - Thursday talk

QueryFedera*onovertheLifeSciencesLinkedOpenDataCloud

MaulikR.KamdarMusenLab

BMIRResearchinprogressTalk

November10,2016

Page 2: BMI Research in Progress - Thursday talk
Page 3: BMI Research in Progress - Thursday talk

TheDataandKnowledgeDiscoveryBoMleneck

BiomedicalQueries

ListdrugsthathaveMol.Wt<1000andinhibitproteinsinvolvedinsignaltransduc*on.Men*ontheirhalf-life.

Listan*neoplas*cagentsthattargetEGFRorPDGFR,withliteraturecita*onsandtheirdownstreamtargets.

3

DesirableDrugsMolecularcharacteris*csProteinTargetsDownstreamGenes…

BiomedicalInforma*csResearchMethods

OpenPHACTS.Williams,etal.DrugDiscoveryToday,2012

Page 4: BMI Research in Progress - Thursday talk

SystemsPharmacology

Zhao,Shan,andRaviIyengar.Annualreviewofpharmacologyandtoxicology52(2012):505.

Page 5: BMI Research in Progress - Thursday talk

IsolatedDatabasesandKnowledgebases

DISTRIBUTED DATA and KNOWLEDGE

5

•  Formats(XML,CSV,MySQLDatabase,etc.)•  En*tyNota*ons(Ensembl,Entrez,HGNC,etc.)•  Schemas(SmallCompound,Compound,etc.)

Page 6: BMI Research in Progress - Thursday talk

Seman*cWebTechnologies

6BernersLee,Scien*ficAmerican2001TimBerners-Lee:ThenextWebofopen,linkeddata(TEDTalk2009)

Page 7: BMI Research in Progress - Thursday talk

Seman*cWeb:PublishingDataasaGraph

7

589.25

mol_weight

Gleevec(Mol.Wt.:589.25,Half-Life:18hours)inhibitsPDGFR,involvedinsignaltransduc*on.

“18hours”half-life

x-ref

GleevecDrugB:DB00619

Gleevec

ResourceDescrip*onFramework(RDF)

KyotoEncyclopediaofGenesandGenomes(KEGG)

Inhibits

target name

type

GO:0007165(Signal

Transduc*on)

process

PDGFR

DrugBank

KEGG:D01441h<p://bio2rdf.org/kegg:D01441

h<p://bio2rdf.org/drugbank:DB00619

UniformResourceIden*fier

Page 8: BMI Research in Progress - Thursday talk

Seman*cWeb:QueryingtheGraph

589.25

mol_weight

PDGFR

Gleevec(Mol.Wt.:589.25,Half-Life:18hours)inhibitsPDGFR,involvedinsignaltransduc*on.

“18hours”half-life

x-ref

GleevecDrugB:DB00615

GleevecKEGG:D01441

<1000

mol_weight

?half-life

x-ref

?

?

ListdrugsthathaveMol.Wt<1000andinhibitproteinsinvolvedinsignal

transduc*on.Men*ontheirhalf-life.

ResourceDescrip*onFramework(RDF) SPARQLQueryLanguage

KEGG

DrugBank DrugBank

8

Inhibits

target name

type

GO:0007165(Signal

Transduc*on)

process

Inhibits

?target name

type

GO:0007165(Signal

Transduc*on)

process

KEGG

Page 9: BMI Research in Progress - Thursday talk

LifeSciencesLinkedOpenDataCloud

Cyganiak,Richardetal.2014

9

Page 10: BMI Research in Progress - Thursday talk

Callahan,A.,etal.,2013.

Saleem,M.,Kamdar,MR.etal.,2014.

Jupp,S.,etal.,2014.

Noy,NF.,etal.,2009.

10

LifeSciencesLinkedOpenDataCloud(LSLOD)

Page 11: BMI Research in Progress - Thursday talk

LSLODQueryFedera*on•  ChallengesminingtheLSLODcloud•  Currentmethods•  Rule-basedmethod(RIP)

Whatthistalkisabout…

LSLODApplica*ons•  BiomedicalQues*on-Answering•  SystemsPharmacology(RIP)

Page 12: BMI Research in Progress - Thursday talk

LSLODQueryFedera*on•  ChallengesminingtheLSLODcloud•  Currentmethods•  Rule-basedmethod(RIP)

Whatthistalkisabout…

LSLODApplica*ons•  BiomedicalQues*on-Answering•  SystemsPharmacology(RIP)

Page 13: BMI Research in Progress - Thursday talk

ChallengesminingtheLSLODcloud

•  IsolatedSPARQLendpointsorRDFDumps

•  DifferentURInota*ons,withnoexplicitx-refslinks•  h)p://bio2rdf.org/uniprot:P45059•  h)p://purl.uniprot.org/uniprot/P45059

•  HeterogeneitybetweentheSemanScWebdatasets

•  TechnicalIssues:MalformedURIs,unavailableSPARQLendpoints,etc.•  h)p://bio2rdf.org/kegg:map00010

h)p://bio2rdf.org/kegg:00010•  h)p://bio2rdf.org/go:0030307\”

Page 14: BMI Research in Progress - Thursday talk

HeterogeneitybetweenSeman*cWebDatasets

14

Gleevec PDGFRdrug-target

Gleevec

Inhibits

PDGFRtarget

name

type

PubMed:21152856

source

ModelMismatch:DifferentgraphpaMernstocapturegranularity

Gleevecmolecular_weight

493.61 Gleevecmol_weight

589.25

LabelMismatch:Differentlabelsforclasses,rela*onsandaMributes

DrugBank

DrugBank KEGG

KEGG

Page 15: BMI Research in Progress - Thursday talk

DataWarehousing:Transformingdataunderoneuniformschemaanduniformnota*ons

15

WAREHOUSING

OpenPHACTS.Williams,2012DataGraphs

✓Efficientqueryexecu*on✓Completeresults✗  Datacopies✗  Inflexible,notscalable

Page 16: BMI Research in Progress - Thursday talk

QueryFederaSon:Execu*ngdifferentpor*onsofqueriesacrossdifferentsources

QUERY FEDERATION

Drugv  molecular-weight<1000v  target

v  process=“GO:0007165”v  half-life

16Schwarte,etal.ISWC2012

Drugv  molecular-weight<1000v  targetv  half-life

Drugv  molecular-weight<1000v  target

v  process=“GO:0007165”

DrugBank KEGG

ListdrugsthathaveMol.Wt<1000andinhibitproteinsinvolvedinsignaltransduc*on.Men*ontheirhalf-life.

Page 17: BMI Research in Progress - Thursday talk

Reconcilia*onduringFedera*on

Gleevecmolecular_weight

493.61 Gleevecmol_weight

589.25

DrugBank KEGG

Informa*ononthesameen*tyintwodifferentsourcesmaybedifferent.

Page 18: BMI Research in Progress - Thursday talk

Reconcilia*onduringFedera*on

DrughasTargetProtein

Uniquerela*onsforagivenrela*ontype,ineachsource.Similarrela*onsinmul*plesources,thatneedtobereconciled.

Page 19: BMI Research in Progress - Thursday talk

PREFIXdrugbank:<hMp://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/>PREFIXkegg:<hMp://bio2rdf.org/ns/kegg#>PREFIXpurl:<hMp://purl.org/dc/elements/1.1/>PREFIXbio2RDF:<hMp://bio2rdf.org/ns/bio2rdf#>SELECTDISTINCT?drug?halflifeWHERE{

SERVICE<hMp://www4.wiwiss.fu-berlin.de/drugbank/sparql>{ ?drugadrugbank:Drug. ?drugdrugbank:molecular-weight?drugbankMolwt.

?drugdrugbank:half-life?halflife. ?drugbio2RDF:inchi?drugbankInchi. ?drugdrugbank:target?drugbankProtein. ?drugpurl:*tle?drugBankName.}?keggDrugbio2RDF:drug-target?keggProtein.?keggDrugbio2RDF:inchi?keggInchi.………………………………………………FILTER(?drugbankInchi=?keggInchi)FILTER(?drugbankMolwt<1000)

}

19

QueryFederaSon:TheSPARQLSERVICEkeyword

Saleem,etal.JournalofWebSeman*cs,2016.

Also,SPARQLisjustmessy.

Page 20: BMI Research in Progress - Thursday talk

Rewri*ngduringQueryFedera*on

Page 21: BMI Research in Progress - Thursday talk

QueryFederaSon:TheSPARQLASKmethod

Schwarte,etal.ISWC2012

?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000?protein<hasGO><GO:0007165>

ASK{?s<hasMolWt>?mw}ASK{?s<hasTarget>?protein}…ASK{?protein<hasGO><GO:0007165>}

Page 22: BMI Research in Progress - Thursday talk

VocabularyofInterlinkedDatasets

LinkedDataontheWebWorkshop(LDOW09),inconjunc*onwithWWW09

Page 23: BMI Research in Progress - Thursday talk

QueryFederaSon:TheVoIDmethod

Gorlitz,etal.ISWC2014

?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000?protein<hasGO><GO:0007165>

?sa<Drug>DrugBank?s<hasMolWt>?mwDrugBank?s<hasTarget>?proteinKEGG?s<hasTarget>?proteinDrugBank

Page 24: BMI Research in Progress - Thursday talk

HeterogeneitybetweenSeman*cWebDatasets

24

Gleevecmolecular_weight

493.61 Gleevecmol_weight

589.25

LabelMismatch:Differentlabelsforclasses,rela*onsandaMributes

DrugBank KEGG

Page 25: BMI Research in Progress - Thursday talk

•  Queryfedera*onoffersascalable,flexibleapproachtowardsdataintegra*onwithouttheflawsofdatawarehousing

•  However,thecurrentmethodsusingonlySERVICE,ASKorVOIDdescrip*onsarenotsuitableoverLSLOD

•  Combiningqueryfedera*onwithschemamapping…

Page 26: BMI Research in Progress - Thursday talk

Mappingsourceschemastoanontology

Callahan,etal.JournalofBiomedicalSeman*cs2013

ComparaSveToxicogenomicsDatabase

SemanScScienceIntegratedOntology

SaccharomycesGenomeDatabase

Page 27: BMI Research in Progress - Thursday talk

CancerChemopreven*onOntologyinOWLLite

27Zeginis,D.etal.Seman*cWebJournal,2013hMp://bioportal.bioontology.org/ontologies/CANCO

Page 28: BMI Research in Progress - Thursday talk

28

QueryFedera*on:UsingRuleTemplates

Hasnain,A.,KamdarMR,etal.ISWC2014

Canco:DrugCanco:molecular_weight

1)Drugbank:DrugDrugbank:molecular_weight2)KEGG:DrugKEGG:mol_wt

CANCO

Page 29: BMI Research in Progress - Thursday talk

LinkingmethodstogenerateRuleTemplates

•  Naïvematching:–  CANCO:DrugçDrugbank:Drug

•  Nameden*tymatching:–  CANCO:MoleculeçChEBI:Compound

•  Domainmatching:–  CANCO:MoleculeçCTD:Chemical

•  If{InchiKey(CTD:Chemicaluri)}≈{InchiKey(ChEBI:Compounduri)}

•  RegEx/IDmatching:–  Uniprot:ProteinçBio2RDF:Uniprot_Resource

•  If{Regex(Uniprot:Proteinuri)}={Regex(Bio2RDF:Uniprot_Resourceuri)}•  h)p://bio2rdf.org/uniprot:P45059,h)p://purl.uniprot.org/uniprot/P45059

Page 30: BMI Research in Progress - Thursday talk

Par*allysolvestheLabelMismatchproblembut…

Gleevec PDGFRdrug-target

Gleevec

Inhibits

PDGFRtarget

name

type

PubMed:21152856

source

ModelMismatch:DifferentgraphpaMernstocapturegranularity

DrugBank KEGG

PDGFRQueryResults:

DrughasTarget?Proteinç DrugBank:DrugDrugBank:drug-target?ProteinDrughasTarget?Proteinç KEGG:DrugKEGG:target?Protein

MappingRules:

Page 31: BMI Research in Progress - Thursday talk

UsingGraphPaMernsforQueryRewri*ng

?DrughasTarget?Proteinç ?DrugDrugBank:drug-target?Protein?DrughasTarget?Proteinç ?DrugKEGG:target?blankKEGG:link?Protein

MappingRules:

ListdrugsthathaveMol.Wt<1000andinhibitproteinsinvolvedinsignaltransduction.Mentiontheirhalf-life.

?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000?protein<hasGO><GO:0007165>

?sa<Drug>{?s<molecular_weight>?mw_blank?mw_blank<value>?mw}?s<drug-target>?protein{?s<half-life>?hl_blank?hl_blank<value>?hl}?mw<1000

?sa<Drug>?s<mol_wt>?mw{?s<target>?protein_blank?protein_blank<link>?protein}?protein<hasGO><GO:0007165>

QueryRewriteQueryRewriSng

Page 32: BMI Research in Progress - Thursday talk

IncludemorecomplexpaMerns…

SmallMolecule

CTD:Chemical

<900

mol_weight

CHEBI:Compound

<900

molecularWeight

value unitDa

Page 33: BMI Research in Progress - Thursday talk

•  QueryfederaSon,combinedwithschemamappingmethods,mayprovideanalterna*veapproachfortheLSLODcloud

•  Querymul*plesourceswithoutbeingconcernedoftheunderlyingheterogeneity.

•  Pa<ern-basedmappingrulescanaidingenera*ngcomplexconstructsaprior,aswellasaidinen*tyreconcilia*on.

•  Manualconstruc*onofsuchPaMern-basedmappingrules--usingavisualinterfaceandrecommenda*on

•  Howtoautomate…

Page 34: BMI Research in Progress - Thursday talk

DrugBank

KEGG

DomainModel

DrugBank

KEGG

FutureWork:Howtolearnthesemappingrules…

34

Inhibits

type

nametarget

nametarget

target

drug-target

drug-target

calculated

Drug target Protein

Pa<ernranking

NumberofNodesNumberofEdgesNo.ofBlanknodesDistribu*onofnodes

StartnodeSim(EpaMern,Emodel)

...

Logis*cRegression

name?

target?

drug-target? ?

Page 35: BMI Research in Progress - Thursday talk

OngoingWork:Evalua*onofthemethod

•  ComparaSveevaluaSon:–  FedX,–  SPLENDIDand–  SPLENDIDaugmentedwiththequeryrewri*ngcomponent

•  Metrics:–  Querycomplexity–  Sourceselec*on*me–  Queryexecu*on*me,etc.

•  Benchmarks:–  LargeRDFBench(BillionTriplesBenchmarkconsis*ngLinkedTCGA,DrugBank,KEGG,Affymetrix)

•  Involvedomainusers…

Page 36: BMI Research in Progress - Thursday talk

LSLODQueryFedera*on•  ChallengesminingtheLSLODcloud•  Currentmethods•  Rule-basedmethod(RIP)

Whatthistalkisabout…

LSLODApplica*ons•  BiomedicalQues*on-Answering•  SystemsPharmacology(RIP)

Page 37: BMI Research in Progress - Thursday talk

Applica*on:SystemsPharmacology

Page 38: BMI Research in Progress - Thursday talk

SystemsPharmacology

Zhao,Shan,andRaviIyengar.Annualreviewofpharmacologyandtoxicology52(2012):505.

Page 39: BMI Research in Progress - Thursday talk

Underlyingmechanismsfordrug-druginterac*ons.

Jiaetal.NaturereviewsDrugdiscovery,8(2):111–128,2009

Page 40: BMI Research in Progress - Thursday talk

DataModel

Concept

E1 Drug

E2 Protein

E3 Pathway

E4 AdverseDrugReac*on

RelaSon

R1 DrughasTargetProtein

R2 DrughasEnzymeProtein

R3 DrughasTransporterProtein

R4 ProteinisPresentInPathway

R5 PathwayisImplicatedInADR

Page 41: BMI Research in Progress - Thursday talk

Manuallygeneratedrules

Source GraphPa<ern(R1)

Drugbank E1<--drug--Target-RelaSon--target-->E2

PharmGKB E1<--drug--gene-drug-AssociaSon--gene-->E2

KEGG E1--target-->_:blank--link-->E2

ComparaSveToxicogenomics

E1<--chemical--Chemical-Gene-AssociaSon--gene-->E2

Page 42: BMI Research in Progress - Thursday talk

PhLeGrA– LinkedGraphAnaly*csinPharmacology

Page 43: BMI Research in Progress - Thursday talk

GraphAnaly*csModule

•  AssociaSon:{Drug}n-->ADR•  2-stateHiddenCondiSonalRandomFieldmodeloverthe

k-par*tenetwork,with(k-2)hiddenlayers.–  Discrimina*veprobabilis*cgraphicalmodel–  Unobserveden**esontheassocia*onpath–  Noassump*ononindependenceofinputs–  QuaMoni,Ariadna,etal.IEEETrans.Pa)ernAnal.Mach.Intell.2007.

•  Inputs–OutcomesdatabasetotraintheHCRF.–  USFDAAdverseEventRepor*ngSystem(FAERS)–  Drugs,AdverseReac*ons,Indica*ons,Dosesetc.–  Textpre-processingusingUMLSterminologies–  X=Drugs,Y=ADRs,H={Proteins,Pathways}

Page 44: BMI Research in Progress - Thursday talk

Networksta*s*cs

Generatedin<1day

Page 45: BMI Research in Progress - Thursday talk

Seman*cWebandSystemsPharmacology

R1:DrughasTargetProteinE1:Drug

•  Similarandcompleteuniqueen**esandrela*onsexistbetweendatasources•  Necessarytogetthecompletepicture,butalsodeterminesourcesofnoise•  “Willthecorrectdrugspleasestandup?”-Southanetal.GCC2016

Page 46: BMI Research in Progress - Thursday talk

AUROCcurvesforsomeADRs

Page 47: BMI Research in Progress - Thursday talk

Webapplica*ontoexploreunderlyingmechanisms

hMp://onto-apps.stanford.edu/phlegra

Page 48: BMI Research in Progress - Thursday talk

FutureWork…

•  Howtotestthesignificanceoftheassocia*onsdiscovered?

•  Implementothermodels(e.g.simplepathenrichment)–  Dealwithrepor*ngbias.

•  Useexis*ng“Silverstandards”suchasOMOP,EU-ADR,DrugbankDDIs,MedSpan,etc.

•  Relevanceoftheassocia*ons,aswellasthecorrectnessoftheunderlyingmechanisms.

•  Importanceofthesourcestocreatesuchnetworks.

Page 49: BMI Research in Progress - Thursday talk

Applica*on:BiomedicalQues*on-Answering

Page 50: BMI Research in Progress - Thursday talk

ReVeaLD:Real-*meVisualExplorerandAggregatorofLinkedData

Kamdar,MaulikR.,etal.Journalofbiomedicalinforma*cs2014.

Page 51: BMI Research in Progress - Thursday talk

Tosummarize…

•  Exci*ngopportuni*estoseamlesslyqueryandintegratedataandknowledgefromisolatedsources.

•  Queryfedera*oncanaidtowardsbiomedicalques*onansweringandsystemspharmacology

•  Thedatacollected,orthenetworksgenerated,canbeusedindownstreamanaly*csapproaches(e.g.Protein/DrugstructuresinAutodockVina)

•  MakeSeman*cWebmoreusableforbiomedicalresearchers!

Page 52: BMI Research in Progress - Thursday talk

Acknowledgments

MusenLab-  MarkMusen-  TaniaTudorache-  CsongorNyulas-  MaMhewHorridge-  RafaelGonçalves-  JosefHardi-  MarcosMar*nez-  Mar*nO’Connor-  JohnGraybeal-  AlexScrenchukAndothers…

52

MichelDumon*erRussAltmanRainerWinnenbergJuanBandaAmrapaliZaveriBMIStudents

SaleemM.AliHasnainAxelNgongaHelenaDeusJonasAlmeidaStefanDecker

“Discovery informa[cs is in its infancy. Searchenginesaregrapplingwith theneedfordeepsearch,butitisdoub^ultheywillfulfilltheneedsofthebiomedicalresearchcommunitywhenitcomestofindingandanalyzingtheappropriatedatasets…”

-PhilipBourne,AssociateDirectorforDataScience,NIH,2014

Page 53: BMI Research in Progress - Thursday talk

Ebola-KBDashboard

53Kamdar,MR.,etal.Database2015.

Page 54: BMI Research in Progress - Thursday talk

LinkedTCGA

54Saleem,M.,KamdarMR,etal.WebSeman*cs,2014.

Page 55: BMI Research in Progress - Thursday talk

Pathway-baseddrugreposi*oning

55Li,etal.BMCBioinforma*cs2013