importing life science at a into neo4j

19
Importing linked life science databases into Neo4j Simon Jupp Sample Phenotypes and Ontologies Team European Bioinformatics Institute [email protected]

Upload: simon-jupp

Post on 16-Mar-2018

170 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Importing life science at a into Neo4j

Importing linked life science databases into Neo4j

Simon Jupp

Sample Phenotypes and Ontologies Team

European Bioinformatics Institute

[email protected]

Page 2: Importing life science at a into Neo4j

Purpose of the workshop

• Introduce two alternate graph models • RDF graphs• Property graph

• Demonstrate a simple data integration use-case

• Show how Neo4j data import tools can be used to rapidly import life science data from public APIs

• Example Cypher for querying biological data

• Introduction to Neo4j sandboxes, Apoc procedures and tips for creating your own Neo4j guide

Page 3: Importing life science at a into Neo4j

Some biological questions

“Differentially expressed genes in adult mice, bred in oxygen rich vs oxygen poor environments? Of this set, which biological processes (GO) are enriched?”

“Where are genes with antigen binding function differentially expressed, which disease and which associated pathways?”

“Get metformin associated pathways with differentially expressed genes, find any proteins that are targets for known diabetes drugs”

How do you go about answering these kinds of questions?...

Page 4: Importing life science at a into Neo4j

… you go to the data

Literature & ontologies•Experimental Factor Ontology•Gene Ontology•BioStudies•Europe PMC

Chemical biology•ChEBI•ChEMBL•SureChEMBL

Molecular structures•Protein Data Bank in Europe•Electron Microscopy Data Bank

Gene, protein & metabolite expression•Expression Atlas•Metabolights•PRIDE•RNA Central

Protein sequences, families & motifs•InterPro•Pfam•UniProt

Genes, genomes & variation•Ensembl •Ensembl Genomes•GWAS Catalog•Metagenomics portal

Systems•BioModels•BioSamples•Enzyme Portal•IntAct•Reactome

Molecular Archives•European Nucleotide Archive•European Variation Archive•European Genome-phenome Archive•ArrayExpress

Page 5: Importing life science at a into Neo4j

Data integration challenges • Heterogeneous formats and identifiers

• We invest heavily in mapping and cross-linking resources, but it’s still hard to integrate and query across internal/external resources. • Lots of effort doing mapping, each groups duplicate these

efforts

Page 6: Importing life science at a into Neo4j

Standardise data publishing

• What is we could standardise the way we publish data? • Global identification systems (so we can identify the things

in our data)• Common semantics (talking about the same things)• A common query language to the data

Page 7: Importing life science at a into Neo4j

Original vision of the Web

Information Management: A Proposal, Tim Berners-Lee, CERN, March 1989, May 1990, http://www.w3.org/History/1989/proposal.html

Relations

“Things”

Vocabularies

Early Web

Page 8: Importing life science at a into Neo4j

Semantic Web ( or Linked Data)

"The Semantic Web is a webby way to link data"

“Turning the web into a global API”

“The existing web links documents, the semantic web links data”

“Shared meaning through ontologies”

The Linking Open Data cloud 2017

http://lod-cloud.net

Page 9: Importing life science at a into Neo4j

RDF is for describing graphs

• 1995-2004 W3C develop specification for a vocabulary for Web meta-data called Resource Description Framework (RDF)

http://en.wikipedia.org/wiki/Barack_Obama

WebDocument

Structured data Publishing data as a graph

dbpedia:Barack_Obama

Human President of the United States

Honolulu

1961-08-04

birthplace

birthdate

position_heldtype

Page 10: Importing life science at a into Neo4j

Anatomy of a triple statement

• All triples are composed of a subject, predicate and an object

Barack Obama

Honolulu

birth placeSubject

Predicate

Object

Page 11: Importing life science at a into Neo4j

Identify things on the web

• Build on existing Web technology• global identifiers for resources (things) using URIs• URIs should resolve

http://dbpedia.org/page/Barack_Obama

http://dbpedia.org/page/Honolulu

http://dbpedia.org/property/birthPlace

Subject

Predicate

Object

Page 12: Importing life science at a into Neo4j

Turning relational data to RDF –EBI Gene Expression Atlas database

Relational Data to RDF graph conversion•Give “things” URIs•Type “things” with ontologies•Link “things” to other related “things”

Page 13: Importing life science at a into Neo4j

Stardog

Apache Jena

SesameVirtuoso

Allegrograph

OWLIM

Storing and querying RDF

• Optimized databases for RDF data

• SPARQL query language

Page 14: Importing life science at a into Neo4j

Querying RDF with SPARQL

• W3C standard query language for querying RDF data• Query language for matching graph patterns in RDF• SPARQL endpoints – common API to query RDF data• ”Get all presidents of the united states?” from

https://query.wikidata.org/

PREFIX position_held: http://www.wikidata.org/prop/direct/P39PREFIX potus: http://www.wikidata.org/entity/Q11696SELECT ?label WHERE {

?subject position_held: potus: . ?subject rdfs:label ?label . filter (lang(?label) = "en")

}

Page 15: Importing life science at a into Neo4j

RDF and the Property graphRDF graphs

dbpedia:Barack_Obama

Human President of the United States

Honolulu

“1961-08-04”xsd:datetime

birthplace

birthdate

position_heldtype

Every statement adds a new edge to the graphAll nodes are resources (with URIs) or literals (with types)

Property graphs (Neo4j)

“Barack_Obama”xsd:string

name

dbpedia:Barack_Obama{name: “Barack Obama”Type: “Human”}

Nodes and edges have internal structure

Honolulu

Birthplace{ Birthdate: “1961-08-04” }

Page 16: Importing life science at a into Neo4j

Working with RDF and Neo4j• RDF great for publishing data

• SPARQL gives flexible access to query data

• But…• RDF schemas are often (necessarily) complex

• Expose the full underlying data semantics

• RDF comes with baggage that can be turn off for newomers

• Neo4j is for graphs• Easier to grasp for begniners• Powerful query language (Cypher)• Excellent third-party tools, community and developer

integrations

Page 17: Importing life science at a into Neo4j

Working with RDF and Neo

• In this tutorial we will harness the publishing power of RDF• Combine with the simplicity and querying power of Neo4j

• Use Neo4j data import tools to rapidly import data from public SPARQL endpoints• Simplify the graph schema to fit a specific use case

Page 18: Importing life science at a into Neo4j

Use-case

• Build a simple graph of gene-disease and drug-disease associations• Data from public resources (Ensembl, GWAS, ChEMBL)

Page 19: Importing life science at a into Neo4j

Setup for workshop

• Sandbox Neo4j instance from https://neo4j.com/sandbox-v2/ • Optionally run your own local installation, but you’ll need

Apoc procedures installed to run

• Run the Neo4j guide

:play https://guides.neo4j.com/life-science-import