semantic variation graphs the case for rdf & sparql

27
Jerven Bolleman Swiss-Prot Group Semantic Variation Graphs the case for RDF & SPARQL

Upload: jerven-bolleman

Post on 13-Apr-2017

194 views

Category:

Science


1 download

TRANSCRIPT

Jerven BollemanSwiss-Prot Group

Semantic Variation Graphsthe case forRDF & SPARQL

MiniIntroduction to RDF

Jerven BollemanSwiss-Prot Group

Resource Description Framework

Subject ObjectPredicate

Resource Description Framework

Resource Description Framework

Subject ObjectPredicate

a

Resource Description Framework

Resource Description Framework

Subject ObjectPredicate

a

query with a

Resource Description Framework

VirtuosoUniversal Server

Lots of SPARQL databases

Resource Description Framework

✔︎

RDF

Turtle

RDFainsideHTML

N-Triples

RDF/

THRIFT

JSON-LD

RDF/

XMLSerialise as

Serialise

as

Ser

ialis

e asSerialise as

Serialise as

Serialise as

Resource Description Framework

RDF

Turtle

RDFainsideHTML

N-Triples

RDF/

THRIFT

JSON-LD

RDF/

XMLSerialise as

Serialise

as

Ser

ialis

e asSerialise as

Serialise as

Serialise as

= =

= =

= =

Resource Description Framework

Nodes and Edges are Resources

• Resource → Identified by a URI– http://purl.uniprot.org/core/– urn:guid:21EC2020-3AEA-4069-A2DD-08002B30309D– mailto:[email protected]– urb:isbn:978-3-16-148410-0

• Nice if public but not a requirement

Resource Description Framework

Terminal edges are literals

• String (xsd:string)“P53”

• Date (xsd:date & xsd:dateTime)"1987-08-13"^^xsd:date

• Numbers (xsd:int & xsd:decimal & …)1 or “1”^^xsd:integer or -1.1 or “-1.1”^^xsd:decimal

• Language string“Switzerland”@en“Suisse”@fr“Schweiz”@de“Svizzera”@it

Resource Description Framework

Others use it too, and are cross query-able

one party evolves data format

everyone evolves data format

Protocol BuffersGoogle's data interchange formatGFF

Variation Graph as RDF

Jerven BollemanSwiss-Prot Group

AC

4 nodes

ACTG

T

GA

Variation Graph as RDF

T

4 nodes

1

2

4

3

AC

ACTG GA

base <uri of vg schema>

prefixnode:<uri of vg graph>

node:1 a <Node> ;rdf:value “ACTG” .

node:2 a <Node> ;rdf:value “AC” .

node:3 a <Node> ;rdf:value “T” .

node:4 a <Node> ;rdf:value “GA”

Variation Graph as RDF

T

4 nodes

1

2

4

3

AC

ACTG GA

base <uri of vg schema>

prefixnode:<uri of vg graph>

node:1<linksForwardToForward>

node:2 , node:3 .

node:2<linksForwardToForward>

node:4 .

node:3<linksForwardToForward>

node:4 .Variation Graph as RDF

T

4 nodes → 1 Path

1

2

4

3

AC

ACTG GA

base <uri of vg schema>

prefixn:<uri of vg graph>

path:1 a <Path> ;rdfs:label “Genome of

patient a” ;rdfs:comment “Paths

through VG make linear sequences, e.g. a reference genome or a patient assembly”

Variation Graph as RDF

T

4 nodes → 1 Path → 3 Steps

1

2

4

3

AC

ACTG GA

base <uri of vg schema>

prefixn:<uri of vg graph>

step:1 a <Step> ;<node> node:1 ;<rank> 1 ;<path> path:1 .

step:2 a <Step> ;<node> node:2 ;<rank> 2 ;<path> path:1 .

Variation Graph as RDF

Jerven BollemanSwiss-Prot Group

Variation Graph explored usingSPARQL

Build a “FASTA” from a VG

PREFIX vg:<http://example.org/vg/>PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>SELECT ?path

(group_concat(?sequence; separator="") as ?pathSeq)WHERE { [] vg:path ?path; vg:node ?node; vg:rank ?rank. ?node rdf:value ?sequence}GROUP BY ?pathORDER BY ?rank

Variation Graph as RDF

PREFIX vg:<http://example.org/vg/>PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>SELECT ?path

(group_concat(?sequence; separator="") as ?pathSeq)WHERE { [] vg:path ?path; vg:node ?node; vg:rank ?rank. ?node rdf:value ?sequence}GROUP BY ?pathORDER BY ?rank

Build a “FASTA” from a VG

PREFIX vg:<http://example.org/vg/>PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>SELECT ?path

(group_concat(?sequence; separator="") as ?pathSeq)WHERE { [] vg:path ?path; vg:node ?node; vg:rank ?rank. ?node rdf:value ?sequence}GROUP BY ?pathORDER BY ?rank

Build a “FASTA” from a VG

PREFIX vg:<http://example.org/vg/>PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>SELECT ?path

(group_concat(?sequence; separator="") as ?pathSeq)WHERE { [] vg:path ?path; vg:node ?node; vg:rank ?rank. ?node rdf:value ?sequence}GROUP BY ?pathORDER BY ?rank

Build a “FASTA” from a VG

SPARQL a standard query language

See VG WIKI for more examples

VG 1000 Genomes → 50 GB on disk in DB

VG 100,000 Genomes → ±2 TB on disk in DB

Querying a Variation Graph

Summary

• RDF– simple data model– consistent identifiers– anyone can say anything about anything

• SPARQL– graph query language– wide scale commercial deployment– HTTP|REST in the box– in clinical use– federated queries on user demand– can be used for variation graphs

Questions?