platforms ciberer and inb-elixir-es
TRANSCRIPT
Joaquín Dopazo
Computational Genomics Department,
Centro de Investigación Príncipe Felipe (CIPF),
Functional Genomics Node, (INB-ELIXIR-es),
Bioinformatics in Rare Diseases (BiER-CIBERER),
Valencia, Spain.
Platforms CIBERER and
INB-ELIXIR-es
http://bioinfo.cipf.es http://www.babelomics.org @xdopazo
Symposium: International platforms for biomedical research: A focus on rare diseases,
Fundacion Ramón Areces, Madrid 3-4 November, 2016
The CIBERER “1000 genomes” Initiative to sequence rare disease patients
Diseases with • Unknown genes • No mutations in known genes
Search for: • New genes • Known genes with unknown modifier genes • Susceptibility genes
http://www.gbpa.es/
Sample providers Sequencing platforms Data analysis
A total of 1044 patients
(including 300 controls) of
more than 30 diseases were
sequenced between 2012 and
2013.
The actors: MGP and CIBERER
MGP is a PPP between the Andalucia local government and Roche. MGP roadmap is based on the availability of: • More than14.000 clinically well characterized samples • An automatically updated PATIENT HEALTH RECORD (PHR) • SAMPLE INFORMATION (SI) That will be used as the first steps towards the implementation of genomic and personalized medicine in the Andalusian HEALTHCARE SYSTEM. A system covering a population of 8.5 million. MGP spans from 2012 to 2014
The Spanish Network for Research in Rare Diseases
(CIBERER) is an initiative of the Spanish Health Ministry.
The CIBERER is composed of 60 research and clinic
groups distributed across the country and has been
running since 2005.
MAX, Pheochromocytome NFU1, Mitochondrial disease GlialCAM, MLC
The results: gene discovery at CIBERER
OTOG, Deafness PLOD2, Osteogenesis COQ4, CoQ10 BMP1, Osteogenesis
2011 2012
PHOX2B, Hirschprung SERAC1, Aciduria ERCC4, Fanconi anemia PPM1K MSUD TNPO3 Muscular dystrophy CFHR1 DDD SERPINF1, LEPRE1, CRTAP, PPIB. Osteogenesis WNT1 Osteogenesis
2013
DNMT3B, Hirschprung YWHAZ, DRP2, Retinitis pigmentosa
RD3 Retinitis pigmentosa TUFM, IL27, Chromosomal rearrangements
LIPT1 Lipoiliation defects BMP1 Osteogenesis
IFITM5 Osteogenesis RNF125 Overgrowth
2014
ZNF408, Retinal dystrophy ATP4A Carcinoid tumor MDH2 Pheochromocytome Junctophilin-1, CMT EGR2 CMT JMJD1C Rett syndrom POT1 Cardiac angiosarcome FAN1 Hereditary colorectal cancer ALDH18A1 Hereditary paraplexy MORC2 CMT ZNF408 Retinitis pigmentosa AR KITLG Waardenburg Syndrome Type 2 CAV1 Neonatal lipodystrophy syndrome IL8, IL13 Renal cell carcinoma
2015
ATP4A Gastric tumor CCNF ALS
2016
Sequencing initiative
ACCI projects
Data management, analysis and
storage = knowledge increase
http://www.gbpa.es/
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
Raw files
(FastQ)
DB
Analysis
Pipeline
Storage
K-DB
Diagnostic
portfolio
Gene 1 ksdhkahcka
Gene 2 jckacsksda
Gene 3 lkkxkccj<jdc
Gene 4 ksfdjvjvlsdkvjd
Gene 5 kckcksñdksd
Gene 6 ldkdkcksdcldl
Gene x kcdlkclkldsklk
Gene Y jcdksdkcdks
Prioritization
report Dialog with experts in the
disease + validations
Samples
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
VCF BAM Processed files
3-Methylglutaconic aciduria (3-MGA-uria) is a
heterogeneous group of syndromes
characterized by an increased excretion of 3-
methylglutaconic and 3-methylglutaric acids.
WES with a consecutive filter approach is
enough to detect the new mutation in this
case.
The prioritization process is actually a
Heuristic Filtering strategy that reduces
the inmense list of candidate variants An example with 3-Methylglutaconic aciduria syndrome
Prioritization programs: making the
prioritization report interactive
Numerous interactive filters to
discard unlikely candidate variants
- Mutational impact
- Population frequency
- Family segregation
- Inheritance mode
- Consequence type
- Functional considerations (GO,
HPO, etc.)
- Etc. Different views, including the genomics perspective
with GenomeMaps
On the importance of the local
variability in the prioritization process
And… on how to
use local
variability without
compromising the
confidential
nature of
genomic data
The CIBERER Spanish Variant Server (CSVS): the first repository of variability of the Spanish population
Only another similar initiative
exists: the GoNL
http://www.nlgenome.nl/ http://ciberer.es/bier/exome-server/
And more recently
the Finnish
and the Icelandic
populations
The CSVS is a crowdsourcing project Scenario: Sequencing projects of healthy
population are expensive and funding
bodies are reluctant to fund them
CSVS Aim: To offer increasingly accurate
information on variant frequencies
characteristic of Spanish population.
CSVS Main use: Frequency-based
filtering of candidate variants
Main data source: Sequencing projects
of individual researchers (CIBERER and
others)
Problem: Most of the contributions
correspond to patient exomes
Idea: Patients of disease A can be
considered healthy pseudo-controls for
disease B (providing no common genetic
background exist between A and B)
Beacon: CSVS will soon appear in the
Beacon server
http://ciberer.es/bier/exome-server/
The CSVS Interface
CSVS is organized in disease categories
CSVS can be queried about chromosomal
regions or genes
Why binning data into ICD-10 categories?
ICD-10 first level of diseases offer two
advantages:
• No (or very low) common genetic
background among ICD categories
• Classes big enough to preserve data
confidentiality. Attempts to identify
individuals within them will produce very
vague phenotype clues
Binning into ICD-10 high level categories
endorsed by CIBERER experts in bioethics.
D1 D2 D3 D4 D5 D6 D7 …… D22
(pseudo) control s for D3
Statistics As of 11/09/2016
CSVS contains 790
unrelated Spanish
individual exomes.
About 1000
expected by the
end of the year
Information provided
Genotype frequencies
in the different
reference populations
Genomic coordinates, variation, gene.
SNPid
if any
Information provided
Pathogenicity indexes
Phenotype,
if available
Variants can also be seen within their genomic context
GenomeMaps viewer (Medina et al., 2013, NAR) embedded in the application.
GenomeMaps is the official genome viewer of the ICGC (http://dcc.icgc.org/)
CSVS provides insights on the portion of the variability already contained in it
Table of Spanish Frequencies
(TSF)
DB of Spanish variants (DBSV)
Chr Position Ref Alt 0/0 0/1 1/1
1 1365313 A T 75 0 0
1 1484884 G A 70 4 1
2 326252 T C 25 35 15
CES use
Other countries
CSVS input
External
Unrelated? (DBSV)
VCFs Spanish? (TSF)
YES YES
NO NO
Counts
Internal
Regional
AIM (Ancestry-informative
markers) are used to
discard kinship and
different ethnicity
?
SIP
Diagnosis+ biomarker discovery: an ongoing
integrated CIBERER initiative Ongoing CIBERER pilot project with the collaboration of seven hospitals: La
Paz, FJD, Ramón y Cajal, CBM (Madrid), Virgen del Rocio (Sevilla), Hospital del
Mar (Barcelona), HU La Fe (Valencia)
http://team.babelomics.org
http://BiERapp.babelomics.org
Diagnostic using NGS and
virtual panels
Diagnostic SNV
Variants of unknown
significance (VUS) and
unexpected findings
management
Medical reports Generation and management
of virtual panels http://team.babelomics.org
100% traceability of
data management
and decisions
The CIBERER CNV server
Stores CNVs found in
patients of different
hospitals, along with
some interesting
information on
ethnicity, location,
phenotype (HPO), etc.,
that can be studied in
the genomic context
(using GenomeMaps)
If everything goes as
planned it will contain
data on more than
15.000 patients from 5
CIBERER hospitals by
the end of the year
What is inside? OpenCGA Overview and goals
Open-source Computational Genomics Analysis (OpenCGA) aims to provide a high performance and scalable solution for genomic big data processing and analysis
OpenCGA is built on OpenCB: CellBase, Genome Maps, Cell Maps, HPG Aligner, HPG BigData, Variant annotation. Project at GitHub: https://github.com/opencb/opencga
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
Extensive capabilities to query across genotype and phenotype relationships
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
Extensive capabilities to query across genotype and phenotype relationships
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
Extensive capabilities to query across genotype and
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional
queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional
queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional
queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA: storage
Extensive capabilities to query across genotype and phenotype relationships
https://github.com/opencb/opencga
Tools developed to improve the pipeline: CellBase, the knowledge DB
Now at: https://github.com/opencb/cellbase
Project: http://bioinfo.cipf.es/compbio/cellbase CellBase (Bleda, 2012, NAR), a comprehensive integrative database and RESTful Web Services API, more than 250GB of data:
● Core features: genes, transcripts, exons, cytobands, proteins (UniProt),...
● Variation: dbSNP and Ensembl SNPs, HapMap, 1000Genomes, EVS, EXAC, etc.
● Pathogenicity indexes and conservation: SIFT, Polyphen, CADD, PhastCons, philoP, GERP, etc.
● Disease: ClinVar, OMIM, HGMV, Cosmic, etc.
● Functional: 40 OBO ontologies (Gene Ontology, HPO, etc.), Interpro, etc.
● Regulatory: TFBS, miRNA targets, conserved regions, etc.
● System biology: Interactome (IntAct), Reactome database, co-expressed genes.
● Compared in testing against VEP: more than 99.999% similarity in Consequence types
● Annotation tool of GEL
● More than 10000 genomes annotated so far
Tools developed to improve the pipeline Genome Maps, the genome viewer
o Genome scale data visualization plays an important role in the data analysis process. It is a big data
management problem.
o Features of Genome Maps (Medina, 2013, NAR; ICGC data analysis portal)
● First 100% HTML5 web based: HTML5+SVG (inspired in Google Maps)
● Always updated, no browser plugins or installation
● Data taken from CellBase, remote NGS data, local files and DAS servers: genes, transcripts, exons, SNPs, TFBS, miRNA
targets, etc.
● Other features: Multi species, API oriented, easy integration, plugin framework, etc.
BAM
viewer
VCF viewer
ICGC genomic viewer www.genomemaps.org
Currently GM is being
implemented in RDConnect
Although already implementing genomic biomarkers we are still in the empirical
medicine era. Without the knowledge of the functional relationship between genotype
and disease we only have (increasingly better) probabilistic associations.
What is next? The transition to precision medicine
Intuitive Based on trial
and error
Identification of probabilistic
patterns
Decisions and actions based on knowledge
Intuitive Medicine Empirical Medicine Precision Medicine
Today Tomorrow
Degree of personalization
Genomic biomarkers
Molecular biomarkers
We think “gene-centric”
http://www.fda.gov/drugs/scienceresearch/researchareas/pharmacogenetics/ucm083378.htm
• Thinking in terms of the unique
causative gene is still reasonable for a
number of rare diseases, but not for all
of them.
• Current GWAS, NGS and gene
expression analyses are eminently
gene-centric
As a consequence of this, most existing
diagnostic and personalized treatments are
based on single-gene biomarkers
Data analysis biomedical platforms need to go beyond
supporting gene-centric pipelines / algorithms / procedures
and evolve towards a systems biology based perspective
Genetic diseases have a modular nature
and, consequently, must be addressed
from a systems biology perspective • With the development of systems biology, studies have shown that phenotypically
similar diseases are often caused by functionally related genes, being referred
to as the modular nature of human genetic diseases (Oti and Brunner, 2007; Oti
et al, 2008).
• This modularity suggests that causative genes for the same or phenotypically
similar diseases may generally reside in the same functional module, either a
protein complex, a sub-network of protein interactions, or a pathway
• Perturbed modules account for disease better than individual perturbed genes
Disease genes are close in the interactome
Goh 2007 PNAS
Same disease
in different
populations is
caused by
different genes
affecting the
same functions
Fernandez, 2013, Orphanet J Rare Dis.
In fact, predictions made with proper models of
functional modules overtake the predictions of
their components
The activity of the pathway is
best correlated to survival
than individual gene activities Fey et al., Sci. Signal. (2015).
ODE used to solve the dynamics of a model
from the expression values of their
components
Problem:
ODE can
efficiently solve
only small systems
Two problems: defining
functional modules and
modeling their behavior Gene ontology:
descriptive; unstructured
functional labels
Networks of Interaction,
regulation, etc.:
relationships among
components but unknown
function
Pathways: relationships
among components and
their functional roles
Models
Enrichment methods. GO, etc. (simple statistical tests)
Connectivity models. Protein-protein, protein-DNA and protein-small molecule interactions (tests on network properties)
Low resolution models. Models of signalling pathways, metabolic pathways, regulatory pathways, etc. (executable models)
Detailed models. Kinetic models including stoichiometry, balancing reactions, etc. (mathematical models)
The behavior of a functional module can be
estimated from the behavior of their
components Transforming gene expression levels into a different metric that accounts for a function. Easiest example of modeling function: signaling pathways. Function: transmission of a signal from a receptor to an effector
Receptors Effectors
Important assumption:
collective changes in gene
expression within the
context of a signaling
circuit are proxies of
changes in protein
activation
Important fact: when the
signal reaches the end of a
circuit triggers a function
Signaling activity trigger cell functions
directly related to cancer progression
Estimations of signal intensity received by the effectors
that trigger a cancer-related function can be related to
clinical parameters, such as survival
Actually, signal activity triggers
all the cancer hallmarks
Hanahan, Weinberg, 2011
Hallmarks of cancer: the next
generation. Cell 144, 646
Negative regulation of release of cytochrome c
from mitochondria (inhibition of apoptosis)
Mechanistic biomarkers
show high specificity and
sensitivity
Models used for obtaining
mechanistic biomarkers
can integrate different
omics data (e.g. mutations)
Mechanistic biomarkers
can be used in the context
of prediction
Specificity Sensitivity
Some interesting features of mechanistic
biomarkers derived from models of pathway
activity
Future prospects: Actionable models
The real advantage of models is that, the same way they can be used
to convert omics data into measurements of cell functionality that
provide information on disease mechanisms and drug MoA, they can
be used to test hypothesis such as “what if I suppress (or over-
express) this gen?” This lead to the concept of actionable models.
By simulating changes of gene expression/activity it is easy to:
• Direct study of the consequences of induced gene over-expressions
or KOs
• Reverse study of genes that need to be perturbed to change cell
functionalities, such as:
• Reverting the “normal” functional status of a cell
• Selectively kill diseased cells without affecting normal cells
• Enhancing or reducing cell functionalities (e.g., apoptosis or
proliferation, respectively, to fight cancer)
• Etc.
Actionable pathway models
KO in RAF1 gene Drugs that
target RAF1
Selected
drugs
extra
targets
Other
pathways
affected
by the KO
Specific
circuits
affected
Action
button
http://pathact.babelomics.org/
The use of new algorithms that enable the transformation of genomic
measurements into cell functionality measurements that account for
disease mechanisms and for drug mechanisms of action will ultimately
allow the real transition from today’s empirical medicine to precision
medicine and provide an increasingly personalized medicine
Biomedical Platforms need to evolve to provide a real support to the transition to
precision medicine
Intuitive Based on trial
and error
Identification of probabilistic
patterns
Decisions and actions based on knowledge
Intuitive Medicine Empirical Medicine Precision Medicine
Today Tomorrow
Degree of personalization
The Computational Genomics Department at the Centro de
Investigación Príncipe Felipe (CIPF), Valencia, Spain, and…
...the INB-ELIXIR, National Institute of Bioinformatics and the BiER (CIBERER Network of Centers for Research in Rare Diseases)
@xdopazo @bioinfocipf Follow us on twitter