bmi research in progress - thursday talk
TRANSCRIPT
QueryFedera*onovertheLifeSciencesLinkedOpenDataCloud
MaulikR.KamdarMusenLab
BMIRResearchinprogressTalk
November10,2016
TheDataandKnowledgeDiscoveryBoMleneck
BiomedicalQueries
ListdrugsthathaveMol.Wt<1000andinhibitproteinsinvolvedinsignaltransduc*on.Men*ontheirhalf-life.
Listan*neoplas*cagentsthattargetEGFRorPDGFR,withliteraturecita*onsandtheirdownstreamtargets.
3
DesirableDrugsMolecularcharacteris*csProteinTargetsDownstreamGenes…
BiomedicalInforma*csResearchMethods
OpenPHACTS.Williams,etal.DrugDiscoveryToday,2012
SystemsPharmacology
Zhao,Shan,andRaviIyengar.Annualreviewofpharmacologyandtoxicology52(2012):505.
IsolatedDatabasesandKnowledgebases
DISTRIBUTED DATA and KNOWLEDGE
5
• Formats(XML,CSV,MySQLDatabase,etc.)• En*tyNota*ons(Ensembl,Entrez,HGNC,etc.)• Schemas(SmallCompound,Compound,etc.)
Seman*cWebTechnologies
6BernersLee,Scien*ficAmerican2001TimBerners-Lee:ThenextWebofopen,linkeddata(TEDTalk2009)
Seman*cWeb:PublishingDataasaGraph
7
589.25
mol_weight
Gleevec(Mol.Wt.:589.25,Half-Life:18hours)inhibitsPDGFR,involvedinsignaltransduc*on.
“18hours”half-life
x-ref
GleevecDrugB:DB00619
Gleevec
ResourceDescrip*onFramework(RDF)
KyotoEncyclopediaofGenesandGenomes(KEGG)
Inhibits
target name
type
GO:0007165(Signal
Transduc*on)
process
PDGFR
DrugBank
KEGG:D01441h<p://bio2rdf.org/kegg:D01441
h<p://bio2rdf.org/drugbank:DB00619
UniformResourceIden*fier
Seman*cWeb:QueryingtheGraph
589.25
mol_weight
PDGFR
Gleevec(Mol.Wt.:589.25,Half-Life:18hours)inhibitsPDGFR,involvedinsignaltransduc*on.
“18hours”half-life
x-ref
GleevecDrugB:DB00615
GleevecKEGG:D01441
<1000
mol_weight
?half-life
x-ref
?
?
ListdrugsthathaveMol.Wt<1000andinhibitproteinsinvolvedinsignal
transduc*on.Men*ontheirhalf-life.
ResourceDescrip*onFramework(RDF) SPARQLQueryLanguage
KEGG
DrugBank DrugBank
8
Inhibits
target name
type
GO:0007165(Signal
Transduc*on)
process
Inhibits
?target name
type
GO:0007165(Signal
Transduc*on)
process
KEGG
LifeSciencesLinkedOpenDataCloud
Cyganiak,Richardetal.2014
9
Callahan,A.,etal.,2013.
Saleem,M.,Kamdar,MR.etal.,2014.
Jupp,S.,etal.,2014.
Noy,NF.,etal.,2009.
10
LifeSciencesLinkedOpenDataCloud(LSLOD)
LSLODQueryFedera*on• ChallengesminingtheLSLODcloud• Currentmethods• Rule-basedmethod(RIP)
Whatthistalkisabout…
LSLODApplica*ons• BiomedicalQues*on-Answering• SystemsPharmacology(RIP)
LSLODQueryFedera*on• ChallengesminingtheLSLODcloud• Currentmethods• Rule-basedmethod(RIP)
Whatthistalkisabout…
LSLODApplica*ons• BiomedicalQues*on-Answering• SystemsPharmacology(RIP)
ChallengesminingtheLSLODcloud
• IsolatedSPARQLendpointsorRDFDumps
• DifferentURInota*ons,withnoexplicitx-refslinks• h)p://bio2rdf.org/uniprot:P45059• h)p://purl.uniprot.org/uniprot/P45059
• HeterogeneitybetweentheSemanScWebdatasets
• TechnicalIssues:MalformedURIs,unavailableSPARQLendpoints,etc.• h)p://bio2rdf.org/kegg:map00010
h)p://bio2rdf.org/kegg:00010• h)p://bio2rdf.org/go:0030307\”
HeterogeneitybetweenSeman*cWebDatasets
14
Gleevec PDGFRdrug-target
Gleevec
Inhibits
PDGFRtarget
name
type
PubMed:21152856
source
ModelMismatch:DifferentgraphpaMernstocapturegranularity
Gleevecmolecular_weight
493.61 Gleevecmol_weight
589.25
LabelMismatch:Differentlabelsforclasses,rela*onsandaMributes
DrugBank
DrugBank KEGG
KEGG
DataWarehousing:Transformingdataunderoneuniformschemaanduniformnota*ons
15
WAREHOUSING
OpenPHACTS.Williams,2012DataGraphs
✓Efficientqueryexecu*on✓Completeresults✗ Datacopies✗ Inflexible,notscalable
QueryFederaSon:Execu*ngdifferentpor*onsofqueriesacrossdifferentsources
QUERY FEDERATION
Drugv molecular-weight<1000v target
v process=“GO:0007165”v half-life
16Schwarte,etal.ISWC2012
Drugv molecular-weight<1000v targetv half-life
Drugv molecular-weight<1000v target
v process=“GO:0007165”
DrugBank KEGG
ListdrugsthathaveMol.Wt<1000andinhibitproteinsinvolvedinsignaltransduc*on.Men*ontheirhalf-life.
Reconcilia*onduringFedera*on
Gleevecmolecular_weight
493.61 Gleevecmol_weight
589.25
DrugBank KEGG
Informa*ononthesameen*tyintwodifferentsourcesmaybedifferent.
Reconcilia*onduringFedera*on
DrughasTargetProtein
Uniquerela*onsforagivenrela*ontype,ineachsource.Similarrela*onsinmul*plesources,thatneedtobereconciled.
PREFIXdrugbank:<hMp://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/>PREFIXkegg:<hMp://bio2rdf.org/ns/kegg#>PREFIXpurl:<hMp://purl.org/dc/elements/1.1/>PREFIXbio2RDF:<hMp://bio2rdf.org/ns/bio2rdf#>SELECTDISTINCT?drug?halflifeWHERE{
SERVICE<hMp://www4.wiwiss.fu-berlin.de/drugbank/sparql>{ ?drugadrugbank:Drug. ?drugdrugbank:molecular-weight?drugbankMolwt.
?drugdrugbank:half-life?halflife. ?drugbio2RDF:inchi?drugbankInchi. ?drugdrugbank:target?drugbankProtein. ?drugpurl:*tle?drugBankName.}?keggDrugbio2RDF:drug-target?keggProtein.?keggDrugbio2RDF:inchi?keggInchi.………………………………………………FILTER(?drugbankInchi=?keggInchi)FILTER(?drugbankMolwt<1000)
}
19
QueryFederaSon:TheSPARQLSERVICEkeyword
Saleem,etal.JournalofWebSeman*cs,2016.
Also,SPARQLisjustmessy.
Rewri*ngduringQueryFedera*on
QueryFederaSon:TheSPARQLASKmethod
Schwarte,etal.ISWC2012
?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000?protein<hasGO><GO:0007165>
ASK{?s<hasMolWt>?mw}ASK{?s<hasTarget>?protein}…ASK{?protein<hasGO><GO:0007165>}
VocabularyofInterlinkedDatasets
LinkedDataontheWebWorkshop(LDOW09),inconjunc*onwithWWW09
QueryFederaSon:TheVoIDmethod
Gorlitz,etal.ISWC2014
?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000?protein<hasGO><GO:0007165>
?sa<Drug>DrugBank?s<hasMolWt>?mwDrugBank?s<hasTarget>?proteinKEGG?s<hasTarget>?proteinDrugBank
HeterogeneitybetweenSeman*cWebDatasets
24
Gleevecmolecular_weight
493.61 Gleevecmol_weight
589.25
LabelMismatch:Differentlabelsforclasses,rela*onsandaMributes
DrugBank KEGG
• Queryfedera*onoffersascalable,flexibleapproachtowardsdataintegra*onwithouttheflawsofdatawarehousing
• However,thecurrentmethodsusingonlySERVICE,ASKorVOIDdescrip*onsarenotsuitableoverLSLOD
• Combiningqueryfedera*onwithschemamapping…
Mappingsourceschemastoanontology
Callahan,etal.JournalofBiomedicalSeman*cs2013
ComparaSveToxicogenomicsDatabase
SemanScScienceIntegratedOntology
SaccharomycesGenomeDatabase
CancerChemopreven*onOntologyinOWLLite
27Zeginis,D.etal.Seman*cWebJournal,2013hMp://bioportal.bioontology.org/ontologies/CANCO
28
QueryFedera*on:UsingRuleTemplates
Hasnain,A.,KamdarMR,etal.ISWC2014
Canco:DrugCanco:molecular_weight
1)Drugbank:DrugDrugbank:molecular_weight2)KEGG:DrugKEGG:mol_wt
CANCO
LinkingmethodstogenerateRuleTemplates
• Naïvematching:– CANCO:DrugçDrugbank:Drug
• Nameden*tymatching:– CANCO:MoleculeçChEBI:Compound
• Domainmatching:– CANCO:MoleculeçCTD:Chemical
• If{InchiKey(CTD:Chemicaluri)}≈{InchiKey(ChEBI:Compounduri)}
• RegEx/IDmatching:– Uniprot:ProteinçBio2RDF:Uniprot_Resource
• If{Regex(Uniprot:Proteinuri)}={Regex(Bio2RDF:Uniprot_Resourceuri)}• h)p://bio2rdf.org/uniprot:P45059,h)p://purl.uniprot.org/uniprot/P45059
Par*allysolvestheLabelMismatchproblembut…
Gleevec PDGFRdrug-target
Gleevec
Inhibits
PDGFRtarget
name
type
PubMed:21152856
source
ModelMismatch:DifferentgraphpaMernstocapturegranularity
DrugBank KEGG
PDGFRQueryResults:
DrughasTarget?Proteinç DrugBank:DrugDrugBank:drug-target?ProteinDrughasTarget?Proteinç KEGG:DrugKEGG:target?Protein
MappingRules:
UsingGraphPaMernsforQueryRewri*ng
?DrughasTarget?Proteinç ?DrugDrugBank:drug-target?Protein?DrughasTarget?Proteinç ?DrugKEGG:target?blankKEGG:link?Protein
MappingRules:
ListdrugsthathaveMol.Wt<1000andinhibitproteinsinvolvedinsignaltransduction.Mentiontheirhalf-life.
?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000?protein<hasGO><GO:0007165>
?sa<Drug>{?s<molecular_weight>?mw_blank?mw_blank<value>?mw}?s<drug-target>?protein{?s<half-life>?hl_blank?hl_blank<value>?hl}?mw<1000
?sa<Drug>?s<mol_wt>?mw{?s<target>?protein_blank?protein_blank<link>?protein}?protein<hasGO><GO:0007165>
QueryRewriteQueryRewriSng
IncludemorecomplexpaMerns…
SmallMolecule
CTD:Chemical
<900
mol_weight
CHEBI:Compound
<900
molecularWeight
value unitDa
• QueryfederaSon,combinedwithschemamappingmethods,mayprovideanalterna*veapproachfortheLSLODcloud
• Querymul*plesourceswithoutbeingconcernedoftheunderlyingheterogeneity.
• Pa<ern-basedmappingrulescanaidingenera*ngcomplexconstructsaprior,aswellasaidinen*tyreconcilia*on.
• Manualconstruc*onofsuchPaMern-basedmappingrules--usingavisualinterfaceandrecommenda*on
• Howtoautomate…
DrugBank
KEGG
DomainModel
DrugBank
KEGG
FutureWork:Howtolearnthesemappingrules…
34
Inhibits
type
nametarget
nametarget
target
drug-target
drug-target
calculated
Drug target Protein
Pa<ernranking
NumberofNodesNumberofEdgesNo.ofBlanknodesDistribu*onofnodes
StartnodeSim(EpaMern,Emodel)
...
Logis*cRegression
name?
target?
drug-target? ?
OngoingWork:Evalua*onofthemethod
• ComparaSveevaluaSon:– FedX,– SPLENDIDand– SPLENDIDaugmentedwiththequeryrewri*ngcomponent
• Metrics:– Querycomplexity– Sourceselec*on*me– Queryexecu*on*me,etc.
• Benchmarks:– LargeRDFBench(BillionTriplesBenchmarkconsis*ngLinkedTCGA,DrugBank,KEGG,Affymetrix)
• Involvedomainusers…
LSLODQueryFedera*on• ChallengesminingtheLSLODcloud• Currentmethods• Rule-basedmethod(RIP)
Whatthistalkisabout…
LSLODApplica*ons• BiomedicalQues*on-Answering• SystemsPharmacology(RIP)
Applica*on:SystemsPharmacology
SystemsPharmacology
Zhao,Shan,andRaviIyengar.Annualreviewofpharmacologyandtoxicology52(2012):505.
Underlyingmechanismsfordrug-druginterac*ons.
Jiaetal.NaturereviewsDrugdiscovery,8(2):111–128,2009
DataModel
Concept
E1 Drug
E2 Protein
E3 Pathway
E4 AdverseDrugReac*on
RelaSon
R1 DrughasTargetProtein
R2 DrughasEnzymeProtein
R3 DrughasTransporterProtein
R4 ProteinisPresentInPathway
R5 PathwayisImplicatedInADR
Manuallygeneratedrules
Source GraphPa<ern(R1)
Drugbank E1<--drug--Target-RelaSon--target-->E2
PharmGKB E1<--drug--gene-drug-AssociaSon--gene-->E2
KEGG E1--target-->_:blank--link-->E2
ComparaSveToxicogenomics
E1<--chemical--Chemical-Gene-AssociaSon--gene-->E2
PhLeGrA– LinkedGraphAnaly*csinPharmacology
GraphAnaly*csModule
• AssociaSon:{Drug}n-->ADR• 2-stateHiddenCondiSonalRandomFieldmodeloverthe
k-par*tenetwork,with(k-2)hiddenlayers.– Discrimina*veprobabilis*cgraphicalmodel– Unobserveden**esontheassocia*onpath– Noassump*ononindependenceofinputs– QuaMoni,Ariadna,etal.IEEETrans.Pa)ernAnal.Mach.Intell.2007.
• Inputs–OutcomesdatabasetotraintheHCRF.– USFDAAdverseEventRepor*ngSystem(FAERS)– Drugs,AdverseReac*ons,Indica*ons,Dosesetc.– Textpre-processingusingUMLSterminologies– X=Drugs,Y=ADRs,H={Proteins,Pathways}
Networksta*s*cs
Generatedin<1day
Seman*cWebandSystemsPharmacology
R1:DrughasTargetProteinE1:Drug
• Similarandcompleteuniqueen**esandrela*onsexistbetweendatasources• Necessarytogetthecompletepicture,butalsodeterminesourcesofnoise• “Willthecorrectdrugspleasestandup?”-Southanetal.GCC2016
AUROCcurvesforsomeADRs
Webapplica*ontoexploreunderlyingmechanisms
hMp://onto-apps.stanford.edu/phlegra
FutureWork…
• Howtotestthesignificanceoftheassocia*onsdiscovered?
• Implementothermodels(e.g.simplepathenrichment)– Dealwithrepor*ngbias.
• Useexis*ng“Silverstandards”suchasOMOP,EU-ADR,DrugbankDDIs,MedSpan,etc.
• Relevanceoftheassocia*ons,aswellasthecorrectnessoftheunderlyingmechanisms.
• Importanceofthesourcestocreatesuchnetworks.
Applica*on:BiomedicalQues*on-Answering
ReVeaLD:Real-*meVisualExplorerandAggregatorofLinkedData
Kamdar,MaulikR.,etal.Journalofbiomedicalinforma*cs2014.
Tosummarize…
• Exci*ngopportuni*estoseamlesslyqueryandintegratedataandknowledgefromisolatedsources.
• Queryfedera*oncanaidtowardsbiomedicalques*onansweringandsystemspharmacology
• Thedatacollected,orthenetworksgenerated,canbeusedindownstreamanaly*csapproaches(e.g.Protein/DrugstructuresinAutodockVina)
• MakeSeman*cWebmoreusableforbiomedicalresearchers!
Acknowledgments
MusenLab- MarkMusen- TaniaTudorache- CsongorNyulas- MaMhewHorridge- RafaelGonçalves- JosefHardi- MarcosMar*nez- Mar*nO’Connor- JohnGraybeal- AlexScrenchukAndothers…
52
MichelDumon*erRussAltmanRainerWinnenbergJuanBandaAmrapaliZaveriBMIStudents
SaleemM.AliHasnainAxelNgongaHelenaDeusJonasAlmeidaStefanDecker
“Discovery informa[cs is in its infancy. Searchenginesaregrapplingwith theneedfordeepsearch,butitisdoub^ultheywillfulfilltheneedsofthebiomedicalresearchcommunitywhenitcomestofindingandanalyzingtheappropriatedatasets…”
-PhilipBourne,AssociateDirectorforDataScience,NIH,2014
Ebola-KBDashboard
53Kamdar,MR.,etal.Database2015.
LinkedTCGA
54Saleem,M.,KamdarMR,etal.WebSeman*cs,2014.
Pathway-baseddrugreposi*oning
55Li,etal.BMCBioinforma*cs2013