deep phenotyping to aid identification of coding & non-coding rare disease variants

20
Deep phenotyping to aid identification of coding & non-coding rare disease variants Melissa Haendel, PhD March 2017 @monarchinit @ontowonka [email protected]

Upload: mhaendel

Post on 05-Apr-2017

313 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

Deep phenotyping to aid identification of coding & non-coding rare disease variants

Melissa Haendel, PhDMarch 2017@monarchinit

@ontowonka [email protected]

Page 2: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

AcknowledgmentsCharite

Max SchubachSebastian Koehler

Univ of MilanGiorgio Valentini

RTIJim Balhoff

OHSUKent ShefchekJohn LetawJulie McMurryNicole VasilevskyMatt BrushTom ConlinDan Keith

Genomics England/Queen Mary

Damian SmedleyJulius Jacobsen

Jackson LaboratoryPeter Robinson

StanfordShruti MarwahaMatthew WheelerEuan Ashley

Lawrence BerkeleyChris MungallSuzanna LewisJeremy NguyenSeth Carbon

GarvanTudor Groza

https://monarchinitiative.org/page/team

Page 3: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

The genome is sequenced, but...

3,398 OMIM

Mendelian Diseases with no known genetic basis

?At least 120,000*

ClinVar

Variants with no known pathogenicity

…we still don’t know very much about what it does

*This is > twice what it was in 2016!

Page 4: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

Prevailing clinical genomic pipelines leverage only a tiny fraction of the available data

PATIENT EXOME/ GENOME

PATIENT CLINICAL PHENOTYPES

PUBLIC GENOMIC DATA

PUBLIC CLINICAL PHENOTYPE, DISEASE DATA

POSSIBLE DISEASES

DIAGNOSIS & TREATMENT

PATIENT ENVIRONMENTPUBLIC ENVIRONMENT,

DISEASE DATA

PATIENT OMICS PHENOTYPES PUBLIC OMICS PHENOTYPES,CORRELATIONS

Under-utilized data

Page 5: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

The Human Phenotype Ontology 11,813

phenotype terms

127,125 rare disease - phenotype annotations

136,268 common disease -phenotype annotations

bit.ly/hpo-paper

Page 6: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

Adding other species’ data helps fill knowledge gaps in human genome

Page 7: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

More species = more coverage

19,008

78%

14,779

Number of human protein-coding genes in ExAC DB as per Lek et al. Nature 2016

19,008

Even inclusion of just four species boosts phenotypic coverage of genes by 38% (5189%)Combined = 89%

19,008

2,195 7,544 7,235 = 16,974 (union of coverage in any species)

9,739

51%

Mungall et al Nucleic Acids Research bit.ly/monarch-nar-2016

Page 8: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

Phenotypic profile matching

Page 9: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

Combining G2P data for variant prioritization

Whole exome

Remove off-target and common variants

Variant score from allele freq and pathogenicity

Phenotype score from phenotypic similarity

PHIVE score to give final candidates

Mendelian filters

Page 10: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

Exomiser results for UDP diagnosed patients

Inclusion of phenotype data improves variant prioritization

In 60% of first 1000 genomes at GEL, Exomiser predicts top candidateIn 86% of cases, Exomiser predicts within top 5

Page 11: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

Example case solved by ExomiserPh

enot

ypic

pr

ofile

Gene

s Heterozygous, missense mutation

STIM-1

N/A

Heterozygous, missense mutation

STIM-1N/A

Stim1Sax/Sax

Ranked STIM-1 variant maximally pathogenic based on cross-species G2P data,

in the absence of traditional data sourceshttp://bit.ly/exomiser

Page 12: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

How to make sense of whole genomes

…when there are 3.5 Billion base pairs and so little is known about non-coding regions?

bit.ly/genomiser-2016

Page 13: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

1) Gather all evidence at each position (3.5B)

• ancestral conservation• GC content• Max methylation, Acetylation, trimethylation levels• DNAse hypersensitivity• Enhancer attributes (robust, permissive)• # overlapping transcription factor binding sites• # rare variants (<0:5% AF) +/-500 nt• # common variants (> 0:5% AF) +/- 500 nt• Overlapping CNVs (ISCA , dbVAR, DGV)• (… 26 features in total)

bit.ly/genomiser-2016

Page 14: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

2) Predict negative controls

> 5% prevalence14.7 M putative non-deleterious positions

Highly conserved in ancestral genomes

bit.ly/genomiser-2016

Page 15: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

3) Hand-curate positives from literatureWe curated 453 regulatory mutations judged as pathogenic by reported phenotypes (HPO) and other metrics

bit.ly/genomiser-2016

Page 16: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

4) Address positive-negative imbalance

14.7 MPutative non-deleterious

453Known regulatory mutations

?

36,000 negative examples are available for every positive one

bit.ly/genomiser-2016

Page 17: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

Synthetically oversample positives,& undersample negatives

14.7 MPutative non-deleterious

453Known regulatory mutations

1) Partition negatives into 100 groups

2) Add to each negative group, all 453 known positives

3) In each group, oversample positives AND undersample negatives

Page 18: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

Strongest predictors of deleterious mutation

• Higher DNAse hypersensitivity• Greater methylation• Richer GC content• Higher ratio of rare:common variation• Higher conservation

bit.ly/genomiser-2016

Page 19: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

4. Benchmark using synthetic genomes 10,235 simulated disease genomes using 1000 Genomes Data Novel Regulatory Mendelian Mutation (ReMM) scoring method

Genomiser +ReMM outperforms other methods/tools across non-coding region types bit.ly/genomiser-2016

Page 20: Deep phenotyping to aid identification  of coding & non-coding rare disease variants

www.monarchinitiative.orgLeadership: Melissa Haendel, Chris Mungall, Peter Robinson,

Tudor Groza, Damian Smedley, Sebastian Köhler, Julie McMurry Funding: NIH Office of Director: 2R24OD011883; NHGRI UDP: HHSN268201300036C,

HHSN268201400093P; NCATS: UDN U01TR001395, Biomedical Data Translator: 1OT3TR002019; E-RARE 2015: Hipbi-RD

01GM1608