platforms ciberer and inb-elixir-es

Joaquín Dopazo

Computational Genomics Department,

Centro de Investigación Príncipe Felipe (CIPF),

Functional Genomics Node, (INB-ELIXIR-es),

Bioinformatics in Rare Diseases (BiER-CIBERER),

Valencia, Spain.

Platforms CIBERER and

INB-ELIXIR-es

http://bioinfo.cipf.es http://www.babelomics.org @xdopazo

Symposium: International platforms for biomedical research: A focus on rare diseases,

Fundacion Ramón Areces, Madrid 3-4 November, 2016

The CIBERER “1000 genomes” Initiative to sequence rare disease patients

Diseases with • Unknown genes • No mutations in known genes

Search for: • New genes • Known genes with unknown modifier genes • Susceptibility genes

http://www.gbpa.es/

Sample providers Sequencing platforms Data analysis

A total of 1044 patients

(including 300 controls) of

more than 30 diseases were

sequenced between 2012 and

2013.

The actors: MGP and CIBERER

MGP is a PPP between the Andalucia local government and Roche. MGP roadmap is based on the availability of: • More than14.000 clinically well characterized samples • An automatically updated PATIENT HEALTH RECORD (PHR) • SAMPLE INFORMATION (SI) That will be used as the first steps towards the implementation of genomic and personalized medicine in the Andalusian HEALTHCARE SYSTEM. A system covering a population of 8.5 million. MGP spans from 2012 to 2014

The Spanish Network for Research in Rare Diseases

(CIBERER) is an initiative of the Spanish Health Ministry.

The CIBERER is composed of 60 research and clinic

groups distributed across the country and has been

running since 2005.

MAX, Pheochromocytome NFU1, Mitochondrial disease GlialCAM, MLC

The results: gene discovery at CIBERER

OTOG, Deafness PLOD2, Osteogenesis COQ4, CoQ10 BMP1, Osteogenesis

2011 2012

PHOX2B, Hirschprung SERAC1, Aciduria ERCC4, Fanconi anemia PPM1K MSUD TNPO3 Muscular dystrophy CFHR1 DDD SERPINF1, LEPRE1, CRTAP, PPIB. Osteogenesis WNT1 Osteogenesis

2013

DNMT3B, Hirschprung YWHAZ, DRP2, Retinitis pigmentosa

RD3 Retinitis pigmentosa TUFM, IL27, Chromosomal rearrangements

LIPT1 Lipoiliation defects BMP1 Osteogenesis

IFITM5 Osteogenesis RNF125 Overgrowth

2014

ZNF408, Retinal dystrophy ATP4A Carcinoid tumor MDH2 Pheochromocytome Junctophilin-1, CMT EGR2 CMT JMJD1C Rett syndrom POT1 Cardiac angiosarcome FAN1 Hereditary colorectal cancer ALDH18A1 Hereditary paraplexy MORC2 CMT ZNF408 Retinitis pigmentosa AR KITLG Waardenburg Syndrome Type 2 CAV1 Neonatal lipodystrophy syndrome IL8, IL13 Renal cell carcinoma

2015

ATP4A Gastric tumor CCNF ALS

2016

Sequencing initiative

ACCI projects

Data management, analysis and

storage = knowledge increase

http://www.gbpa.es/

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

ATTGCGATT

GGCAGAGC

GGCAAAGT

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

ATTGCGATT

GGCAGAGC

GGCAAAGT

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

ATTGCGATT

GGCAGAGC

GGCAAAGT

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

ATTGCGATT

GGCAGAGC

GGCAAAGT

Raw files

(FastQ)

DB

Analysis

Pipeline

Storage

K-DB

Diagnostic

portfolio

Gene 1 ksdhkahcka

Gene 2 jckacsksda

Gene 3 lkkxkccj<jdc

Gene 4 ksfdjvjvlsdkvjd

Gene 5 kckcksñdksd

Gene 6 ldkdkcksdcldl

Gene x kcdlkclkldsklk

Gene Y jcdksdkcdks

Prioritization

report Dialog with experts in the

disease + validations

Samples

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

VCF BAM Processed files

3-Methylglutaconic aciduria (3-MGA-uria) is a

heterogeneous group of syndromes

characterized by an increased excretion of 3-

methylglutaconic and 3-methylglutaric acids.

WES with a consecutive filter approach is

enough to detect the new mutation in this

case.

The prioritization process is actually a

Heuristic Filtering strategy that reduces

the inmense list of candidate variants An example with 3-Methylglutaconic aciduria syndrome

Prioritization programs: making the

prioritization report interactive

Numerous interactive filters to

discard unlikely candidate variants

- Mutational impact

- Population frequency

- Family segregation

- Inheritance mode

- Consequence type

- Functional considerations (GO,

HPO, etc.)

- Etc. Different views, including the genomics perspective

with GenomeMaps

On the importance of the local

variability in the prioritization process

And… on how to

use local

variability without

compromising the

confidential

nature of

genomic data

The CIBERER Spanish Variant Server (CSVS): the first repository of variability of the Spanish population

Only another similar initiative

exists: the GoNL

http://www.nlgenome.nl/ http://ciberer.es/bier/exome-server/

And more recently

the Finnish

and the Icelandic

populations

The CSVS is a crowdsourcing project Scenario: Sequencing projects of healthy

population are expensive and funding

bodies are reluctant to fund them

CSVS Aim: To offer increasingly accurate

information on variant frequencies

characteristic of Spanish population.

CSVS Main use: Frequency-based

filtering of candidate variants

Main data source: Sequencing projects

of individual researchers (CIBERER and

others)

Problem: Most of the contributions

correspond to patient exomes

Idea: Patients of disease A can be

considered healthy pseudo-controls for

disease B (providing no common genetic

background exist between A and B)

Beacon: CSVS will soon appear in the

Beacon server

http://ciberer.es/bier/exome-server/

The CSVS Interface

CSVS is organized in disease categories

CSVS can be queried about chromosomal

regions or genes

Why binning data into ICD-10 categories?

ICD-10 first level of diseases offer two

advantages:

• No (or very low) common genetic

background among ICD categories

• Classes big enough to preserve data

confidentiality. Attempts to identify

individuals within them will produce very

vague phenotype clues

Binning into ICD-10 high level categories

endorsed by CIBERER experts in bioethics.

D1 D2 D3 D4 D5 D6 D7 …… D22

(pseudo) control s for D3

Statistics As of 11/09/2016

CSVS contains 790

unrelated Spanish

individual exomes.

About 1000

expected by the

end of the year

Information provided

Genotype frequencies

in the different

reference populations

Genomic coordinates, variation, gene.

SNPid

if any

Information provided

Pathogenicity indexes

Phenotype,

if available

Variants can also be seen within their genomic context

GenomeMaps viewer (Medina et al., 2013, NAR) embedded in the application.

GenomeMaps is the official genome viewer of the ICGC (http://dcc.icgc.org/)

CSVS provides insights on the portion of the variability already contained in it

Table of Spanish Frequencies

(TSF)

DB of Spanish variants (DBSV)

Chr Position Ref Alt 0/0 0/1 1/1

1 1365313 A T 75 0 0

1 1484884 G A 70 4 1

2 326252 T C 25 35 15

CES use

Other countries

CSVS input

External

Unrelated? (DBSV)

VCFs Spanish? (TSF)

YES YES

NO NO

Counts

Internal

Regional

AIM (Ancestry-informative

markers) are used to

discard kinship and

different ethnicity

?

SIP

Diagnosis+ biomarker discovery: an ongoing

integrated CIBERER initiative Ongoing CIBERER pilot project with the collaboration of seven hospitals: La

Paz, FJD, Ramón y Cajal, CBM (Madrid), Virgen del Rocio (Sevilla), Hospital del

Mar (Barcelona), HU La Fe (Valencia)

http://team.babelomics.org

http://BiERapp.babelomics.org

Diagnostic using NGS and

virtual panels

Diagnostic SNV

Variants of unknown

significance (VUS) and

unexpected findings

management

Medical reports Generation and management

of virtual panels http://team.babelomics.org

100% traceability of

data management

and decisions

The CIBERER CNV server

Stores CNVs found in

patients of different

hospitals, along with

some interesting

information on

ethnicity, location,

phenotype (HPO), etc.,

that can be studied in

the genomic context

(using GenomeMaps)

If everything goes as

planned it will contain

data on more than

15.000 patients from 5

CIBERER hospitals by

the end of the year

What is inside? OpenCGA Overview and goals

Open-source Computational Genomics Analysis (OpenCGA) aims to provide a high performance and scalable solution for genomic big data processing and analysis

OpenCGA is built on OpenCB: CellBase, Genome Maps, Cell Maps, HPG Aligner, HPG BigData, Variant annotation. Project at GitHub: https://github.com/opencb/opencga

https://github.com/opencb/opencga


Transform 97 min

Load 80 sec

Merge 84 sec

Annotate 2000 v / sec

Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

6 node Hadoop cluster:

• Transform: 97 min

• Load: 80 sec

• Merge: 84 sec

• Millisecond response

times for regional queries

• Whole genome filtering

queries for all individuals

within seconds

OpenCGA - Storage

Extensive capabilities to query across genotype and phenotype relationships

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times



• Load: 80 sec

• Merge: 84 sec





within seconds

OpenCGA - Storage


Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times



• Load: 80 sec

• Merge: 84 sec





within seconds

OpenCGA - Storage

Extensive capabilities to query across genotype and

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times



• Load: 80 sec

• Merge: 84 sec





within seconds

OpenCGA - Storage

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times



• Load: 80 sec

• Merge: 84 sec


times for regional

queries



within seconds

OpenCGA - Storage

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times

Transform 97 min

Load 80 sec

Merge 84 sec


Times



• Load: 80 sec

• Merge: 84 sec


times for regional

queries



within seconds

OpenCGA - Storage



• Load: 80 sec

• Merge: 84 sec


times for regional

queries



within seconds

OpenCGA: storage



Tools developed to improve the pipeline: CellBase, the knowledge DB

Now at: https://github.com/opencb/cellbase

Project: http://bioinfo.cipf.es/compbio/cellbase CellBase (Bleda, 2012, NAR), a comprehensive integrative database and RESTful Web Services API, more than 250GB of data:

● Core features: genes, transcripts, exons, cytobands, proteins (UniProt),...

● Variation: dbSNP and Ensembl SNPs, HapMap, 1000Genomes, EVS, EXAC, etc.

● Pathogenicity indexes and conservation: SIFT, Polyphen, CADD, PhastCons, philoP, GERP, etc.

● Disease: ClinVar, OMIM, HGMV, Cosmic, etc.

● Functional: 40 OBO ontologies (Gene Ontology, HPO, etc.), Interpro, etc.

● Regulatory: TFBS, miRNA targets, conserved regions, etc.

● System biology: Interactome (IntAct), Reactome database, co-expressed genes.

● Compared in testing against VEP: more than 99.999% similarity in Consequence types

● Annotation tool of GEL

● More than 10000 genomes annotated so far

http://docs.bioinfo.cipf.es/projects/cellbase/wiki

http://bioinfo.cipf.es/compbio/cellbase

Tools developed to improve the pipeline Genome Maps, the genome viewer

o Genome scale data visualization plays an important role in the data analysis process. It is a big data

management problem.

o Features of Genome Maps (Medina, 2013, NAR; ICGC data analysis portal)

● First 100% HTML5 web based: HTML5+SVG (inspired in Google Maps)

● Always updated, no browser plugins or installation

● Data taken from CellBase, remote NGS data, local files and DAS servers: genes, transcripts, exons, SNPs, TFBS, miRNA

targets, etc.

● Other features: Multi species, API oriented, easy integration, plugin framework, etc.

BAM

viewer

VCF viewer

ICGC genomic viewer www.genomemaps.org

Currently GM is being

implemented in RDConnect

Although already implementing genomic biomarkers we are still in the empirical

medicine era. Without the knowledge of the functional relationship between genotype

and disease we only have (increasingly better) probabilistic associations.

What is next? The transition to precision medicine

Intuitive Based on trial

and error

Identification of probabilistic

patterns

Decisions and actions based on knowledge

Intuitive Medicine Empirical Medicine Precision Medicine

Today Tomorrow

Degree of personalization

Genomic biomarkers

Molecular biomarkers

We think “gene-centric”

http://www.fda.gov/drugs/scienceresearch/researchareas/pharmacogenetics/ucm083378.htm

• Thinking in terms of the unique

causative gene is still reasonable for a

number of rare diseases, but not for all

of them.

• Current GWAS, NGS and gene

expression analyses are eminently

gene-centric

As a consequence of this, most existing

diagnostic and personalized treatments are

based on single-gene biomarkers

Data analysis biomedical platforms need to go beyond

supporting gene-centric pipelines / algorithms / procedures

and evolve towards a systems biology based perspective

Genetic diseases have a modular nature

and, consequently, must be addressed

from a systems biology perspective • With the development of systems biology, studies have shown that phenotypically

similar diseases are often caused by functionally related genes, being referred

to as the modular nature of human genetic diseases (Oti and Brunner, 2007; Oti

et al, 2008).

• This modularity suggests that causative genes for the same or phenotypically

similar diseases may generally reside in the same functional module, either a

protein complex, a sub-network of protein interactions, or a pathway

• Perturbed modules account for disease better than individual perturbed genes

Disease genes are close in the interactome

Goh 2007 PNAS

Same disease

in different

populations is

caused by

different genes

affecting the

same functions

Fernandez, 2013, Orphanet J Rare Dis.

In fact, predictions made with proper models of

functional modules overtake the predictions of

their components

The activity of the pathway is

best correlated to survival

than individual gene activities Fey et al., Sci. Signal. (2015).

ODE used to solve the dynamics of a model

from the expression values of their

components

Problem:

ODE can

efficiently solve

only small systems

Two problems: defining

functional modules and

modeling their behavior Gene ontology:

descriptive; unstructured

functional labels

Networks of Interaction,

regulation, etc.:

relationships among

components but unknown

function

Pathways: relationships

among components and

their functional roles

Models

Enrichment methods. GO, etc. (simple statistical tests)

Connectivity models. Protein-protein, protein-DNA and protein-small molecule interactions (tests on network properties)

Low resolution models. Models of signalling pathways, metabolic pathways, regulatory pathways, etc. (executable models)

Detailed models. Kinetic models including stoichiometry, balancing reactions, etc. (mathematical models)

The behavior of a functional module can be

estimated from the behavior of their

components Transforming gene expression levels into a different metric that accounts for a function. Easiest example of modeling function: signaling pathways. Function: transmission of a signal from a receptor to an effector

Receptors Effectors

Important assumption:

collective changes in gene

expression within the

context of a signaling

circuit are proxies of

changes in protein

activation

Important fact: when the

signal reaches the end of a

circuit triggers a function

Signaling activity trigger cell functions

directly related to cancer progression

Estimations of signal intensity received by the effectors

that trigger a cancer-related function can be related to

clinical parameters, such as survival

Actually, signal activity triggers

all the cancer hallmarks

Hanahan, Weinberg, 2011

Hallmarks of cancer: the next

generation. Cell 144, 646

Negative regulation of release of cytochrome c

from mitochondria (inhibition of apoptosis)

Mechanistic biomarkers

show high specificity and

sensitivity

Models used for obtaining

mechanistic biomarkers

can integrate different

omics data (e.g. mutations)

Mechanistic biomarkers

can be used in the context

of prediction

Specificity Sensitivity

Some interesting features of mechanistic

biomarkers derived from models of pathway

activity

Future prospects: Actionable models

The real advantage of models is that, the same way they can be used

to convert omics data into measurements of cell functionality that

provide information on disease mechanisms and drug MoA, they can

be used to test hypothesis such as “what if I suppress (or over-

express) this gen?” This lead to the concept of actionable models.

By simulating changes of gene expression/activity it is easy to:

• Direct study of the consequences of induced gene over-expressions

or KOs

• Reverse study of genes that need to be perturbed to change cell

functionalities, such as:

• Reverting the “normal” functional status of a cell

• Selectively kill diseased cells without affecting normal cells

• Enhancing or reducing cell functionalities (e.g., apoptosis or

proliferation, respectively, to fight cancer)

• Etc.

Actionable pathway models

KO in RAF1 gene Drugs that

target RAF1

Selected

drugs

extra

targets

Other

pathways

affected

by the KO

Specific

circuits

affected

Action

button

http://pathact.babelomics.org/

The use of new algorithms that enable the transformation of genomic

measurements into cell functionality measurements that account for

disease mechanisms and for drug mechanisms of action will ultimately

allow the real transition from today’s empirical medicine to precision

medicine and provide an increasingly personalized medicine

Biomedical Platforms need to evolve to provide a real support to the transition to

precision medicine

Intuitive Based on trial

and error

Identification of probabilistic

patterns

Decisions and actions based on knowledge

Intuitive Medicine Empirical Medicine Precision Medicine

Today Tomorrow

Degree of personalization

The Computational Genomics Department at the Centro de

Investigación Príncipe Felipe (CIPF), Valencia, Spain, and…

...the INB-ELIXIR, National Institute of Bioinformatics and the BiER (CIBERER Network of Centers for Research in Rare Diseases)

@xdopazo @bioinfocipf Follow us on twitter

platforms ciberer and inb-elixir-es

Health & Medicine