high-performance web services for gene and variant annotations

36
Chunlei Wu, Ph.D. [email protected] @chunleiwu Associate Professor of Molecular Medicine Dept. of Molecular Experimental Medicine The Scripps Research Institute La Jolla, CA, USA 07/2016 High-performance web services for gene and variant annotations MyVariant.info MyGene.info

Upload: chunlei-wu

Post on 12-Feb-2017

194 views

Category:

Science


2 download

TRANSCRIPT

Page 1: High-performance web services for gene and variant annotations

Chunlei Wu, [email protected]

@chunleiwu

Associate Professor of Molecular MedicineDept. of Molecular Experimental Medicine

The Scripps Research InstituteLa Jolla, CA, USA

07/2016

High-performance web services forgene and variant annotations

MyVariant.info

MyGene.info

Page 2: High-performance web services for gene and variant annotations

Biological knowledge is a complex network

No one-fit-all database can capture the entire knowledge space

Page 3: High-performance web services for gene and variant annotations

Typical database representations

{ _id: 1017, name: CDK2, taxid: 9606}

Relational database

Document database

RDF triplestore

Tables JSON objects Triples

Key-value store

Key-value pairs

Page 4: High-performance web services for gene and variant annotations

BioThings APIs are built on document databases

Why we picked document databases:• Object representation

• Rich data structures, handles heterogeneous data very well

• Atomic operations, built for big-data scale

Page 5: High-performance web services for gene and variant annotations

Gene and Variant annotations represented in JSON documents

{ "_id": "chr1:g.196659237C>T", "cosmic": { "chrom": "1", "hg19": { "start": 196659237, "end": 196659237 }, "ref": "C", "alt": "T", "tumor_site": "breast", "mut_freq": 0.49, "mut_nt": "C>T", "cosmic_id": "COSM424915"}

{ “_id”: “1017”, “Symbol”: “CDK2”, “Ensembl”: “ENSG00000123374”, “RefSeq”: [ “NM_001798”, “NM_052827” ], “Reporter”: { “U95A”: [ “1792_g_at”, “1833_at” ], “U133A”:[ “211804_s_at”, “2045252_at”, “211803_at” ] }}

Page 6: High-performance web services for gene and variant annotations

Keep data always up-to-date

Each data source is updated individually. Colors indicate their different updating

schedules.

Schematic view of MyVariant.info architecture

Page 7: High-performance web services for gene and variant annotations

High-performance web service APIs

Schematic view of MyVariant.info architecture

Page 8: High-performance web services for gene and variant annotations

MyGene.info + MyVariant.infoGene

G

Variant

V

MyVariant.info

MyGene.info

/v2/gene/<geneid>/v2/query?q=<query>

/v1/variant/<hgvsid>/v1/query?q=<query>

/v3/gene/<geneid>/v3/query?q=<query>

single query on GET, batch query on POST

Page 9: High-performance web services for gene and variant annotations

We focus on building APIs. Try to …

Page 10: High-performance web services for gene and variant annotations

Make it really easy to use

Just two endpoints

No registration/sign-in

No API key

Page 11: High-performance web services for gene and variant annotations

Developer-friendly

Python/R clients(also js client for myvariant)

search “mygene” and “myvariant”in PyPI and Bioconductor

JSONPCORShttps

msgpackhttp compression

http cachingJSON-LD

Supported!

Page 12: High-performance web services for gene and variant annotations

Aggregate Everything about gene and variant

MyVariant.info

MyGene.info

Support >15M genes for ~17K species ~ 200 annotation fields

Support > 334 M variants ~ 500 annotation fields

from 14 sources: ClinVar dbNSFP dbSNP …

Page 13: High-performance web services for gene and variant annotations

Keep up-to-date

MyVariant.info

MyGene.info

Weekly ~Monthly

Support >15M genes for ~17K species ~ 200 annotation fields

Support > 334 M variants ~ 500 annotation fields

from 14 sources: ClinVar dbNSFP dbSNP …

Page 14: High-performance web services for gene and variant annotations

High-performance and scalable

>95% queries response < 30ms

Page 15: High-performance web services for gene and variant annotations

High-performance and scalable

“Stress test” suggests support for

>5,000 concurrent users for

~10,000 requests per minute

Page 16: High-performance web services for gene and variant annotations

High-performance and scalable

Page 17: High-performance web services for gene and variant annotations

High availability

99.999% over last year

MyVariant.info

MyGene.info

99.87% over last 6

months

Availability tracked by

Page 18: High-performance web services for gene and variant annotations

Who is using

MinePath.org

Gene Wiki

JBrowse

Live applications:

Page 19: High-performance web services for gene and variant annotations

Who is using

Many users use them in their

daily analysis pipelines or

simply caching annotations locally

Page 20: High-performance web services for gene and variant annotations

MyGene.info recent usage statsrequests unique IPs

Jan-16 3,885,192 2,498Feb-16 5,313,950 2,786Mar-16 3,362,354 3,121Apr-16 10,918,104 3,065

May-16 10,776,858 3,803Jun-16 6,396,148 3,940

39%direct calls 38%

mygene.py

14%mygene.R

9%BioGPS

Over 40M requestsIn six months

Page 21: High-performance web services for gene and variant annotations

MyVariant.info recent usage stats

requests unique IPsJan-16 83,519 1,330Feb-16 3,054,191 1,192Mar-16 272,424 1,771Apr-16 701,526 1,500

May-16 89,642 1,891Jun-16 213,767 1,924

21%direct calls 23%

myvariant.py

50%myvariant.R

6%myvariant.js

~4.5M requestsIn six months

Page 22: High-performance web services for gene and variant annotations

Generalized BioThings SDK

BioThings SDK

MyVariant.info

MyGene.info JSON data aggregation mechanism

High-performance query engine

Well-designed REST API pattern

JSON-LD enabled Linked Data

Data-updating schedulerPython/R clients…

Page 23: High-performance web services for gene and variant annotations

BioThings SDK

A tutorial here (more docs are coming):http://biothingsapi.readthedocs.io/en/latest/

Page 24: High-performance web services for gene and variant annotations

v.biothings.io

g.biothings.io

BioThings SDK

gene

variant

s.biothings.io

species/taxonomy

alias to MyGene.info

alias to MyVariant.info

Page 25: High-performance web services for gene and variant annotations

BioThings API for species/taxonomy

{"_id": "9606","_version": 1,"authority": [

"homo sapiens linnaeus, 1758"],"children": [ 63221, 741158],"common_name": "man","genbank_common_name": "human","has_gene": true,"lineage": [ 9606, 9605, 207598, …,131567, 1],"parent_taxid": 9605,"rank": "species","scientific_name": "homo sapiens","taxid": 9606,"uniprot_name": "homo sapiens"

}

http://s.biothings.io/v1/species/9606?include_children=true

Page 26: High-performance web services for gene and variant annotations

BioThings API for species/taxonomy

{ "hits": [ { "_id": "1239", "_score": 10.971453, "common_name": […], "genbank_common_name": "gram-positive bacteria", "has_gene": false, "lineage": [1239, 1783272, 2, 131567, 1], "parent_taxid": 1783272, "rank": "phylum", "scientific_name": "firmicutes", "taxid": 1239, "uniprot_name": "firmicutes" } ], "max_score": 10.971453, "took": 12, "total": 1}

http://s.biothings.io/v1/query?q=rank:phylum AND common_name:gram-positive

Page 27: High-performance web services for gene and variant annotations

Species API used in MyGene.info

You can now query for genes beyond species:

Q: Give me all lytic enzymes for any firmicutes

http://mygene.info/v3/query?q=lytic enzyme&species=1239&include_tax_tree=true

http://mygene.info/v3/query?q=lytic enzyme&species=1239

0 hits

5 hits

Page 28: High-performance web services for gene and variant annotations

Very minimal code for building a species API

Page 29: High-performance web services for gene and variant annotations

Have the flexibility to customize your query

Page 30: High-performance web services for gene and variant annotations

v.biothings.io

g.biothings.io

BioThings SDK

s.biothings.io

c.biothings.io

gene

variant

species/taxonomydrugs/ compounds

∙ ∙ ∙ ∙ ∙ ∙

alias to MyGene.info

alias to MyVariant.info

disease d.biothings.io

Page 31: High-performance web services for gene and variant annotations

BioThings APIs

A collection of data APIs A framework for building new APIs

Data as a service

Software as a serviceGot a new type of “BioThings”?

We can help you to build or even host your biothings API

Page 32: High-performance web services for gene and variant annotations

BioThings TEAM

Funding and SupportU01HG008473U54GM114833

TSRI:

Chunlei WuAndrew SuJiwen XinCyrus AfrasiabiSebastien LelongGinger TsuengJulee AdesaraMike Mayers

U. Washington:

Sean MooneyMoritz JuchlerNikhil Gopal

Page 33: High-performance web services for gene and variant annotations

Source code

• MyGene.infohttps://github.com/sulab/mygene.info

• MyVariant.infohttps://github.com/sulab/myvariant.info

• BioThings API for species/taxonomyhttps://github.com/sulab/biothings.species

• BioThings SDKhttps://github.com/sulab/biothings.api

Page 34: High-performance web services for gene and variant annotations

DEMO time!

by Jiwen (Kevin) Xin

Page 35: High-performance web services for gene and variant annotations

2441

2308

1917

18

9

5

Initial number of genes mutated in all four patients:

filter2 <- lapply(filter1, function(i) subset(i, cadd.consequence %in% c("NON_SYNONYMOUS", "STOP_GAINED", "STOP_LOST", "CANONICAL_SPLICE", "SPLICE_SITE")))

nVars <- countGenes(vars)

filter1 <- lapply(vars, function(i) subset(i, DP > 8 & FS < 30 & QD > 2))

Filtering for sequencing coverage and strand bias:

Filtering for nonsynonymous and splice site variants:

filter3 <- lapply(filter2, function(i) subset(i, exac.af < 0.01)) Filtering for rare variants based on allele frequencies from ExAC:

filter4 <- lapply(filter3, function(i) subset(i, sapply(dbnsfp.1000gp1.af, function(j) j < 0.01 )))

Filtering for rare variants based on allele frequencies from 1000 Genomes Project:

goBP <- data.frame(queryMany(top.genes$Var1, scopes="symbol", species="human", fields=c("go.BP", "name", "MIM", "uniprot")))

# The Bioconductor package go.DB is used to find all genes with a GO biological process annotation that # is a descendant of GO:0008152 - the GO id for metabolic process.miller.bp <- lapply(goBP$go.BP, function(i) unlist(i$id))bp.ancestor <- lapply(miller.bp, function(i) sapply(i, function(j) "GO:0008152" %in% unlist(GOBPANCESTOR[[j]])))candidate.genes <- top.genes$Var1[sapply(bp.ancestor, function(i) TRUE %in% i)]

Filtering by GO biological process annotation using MyGene.info:

Number of genes Filtering steps to prioritize candidate genes:

Page 36: High-performance web services for gene and variant annotations

Demos in Jupyter notebooks

• Using myvariant and mygene in R for variant prioritizationhttp://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/docs/ipynb/myvariant_R_miller.ipynb

• Access ClinVar data from myvariant in Pythonhttp://nbviewer.jupyter.org/github/SuLab/myvariant.info/blob/master/docs/ipynb/myvariant_clinvar_demo.ipynb

• ID mapping using mygene module in Pythonhttp://nbviewer.jupyter.org/gist/newgene/6771106