mappin: method for annotating, predicting pathogenicity ... · and mode of inheritance for...
TRANSCRIPT
2.20e-206 3.55e-92
2.95e-07 2.32e-08
1.49e-08 3.74e-23
ns 4.19e-05
2.48e-08 1.53e-06
ns 2.32e-08
MAPPIN: Method for Annotating, Predicting Pathogenicity, and mode of Inheritance for Nonsynonymous variants
Nehal Gosalia1,2, Aris N. Economides1,2, Frederick E. Dewey1, and Suganthi Balasubramanian1
1Regeneron Genetics Center, 2Regeneron Pharmaceuticals, Tarrytown, NY
BACKGROUND• An average exome can contain 10,500-13,500 nonsynonymous
single nucleotide variants (nsSNVs)1,2, which is lower thanexpected suggesting negative selection3,4
• A major challenge with whole exome sequencing (WES) isdifferentiating benign and disease-causing variants
• HGMD and OMIM® databases show that nsSNVs account for~45% of disease-causing mutations5,6 making it critical toidentify them
• Many algorithms predict pathogenicity of nsSNVs, howevernone of them are able to distinguish dominant vs. recessive-disease causing mutations7
• It is important to differentiate between heterozygous dominant-acting variants and heterozygous carrier variants
MAPPIN• Method for Annotating, Predicting Pathogenicity, and
mode of Inheritance for Nonsynonymous variants
• Training data sets:- Pathogenic variants from UniProt, ExoVar7,subdivided using known dominant- and recessive-disease causing genes from OMIM®6 and others8,9
- Haploinsufficient genes subset from dominantgenes using haploinsufficiency predictions10
- Benign variants from ClinVar11 refined by a) criteriaprovided, multiple submitters, no conflict, b) reviewedby expert panel, and c) practice guideline (★★-★★★★)
Input
VariantAnnotation
VCF File (chr, pos, ref, alt)
99 evolutionary, functional, network, and allele frequency
features
Training sets ExoVar
(pathogenic)ClinVar
(benign)
Scores for Dominant,
Recessive, and Benign classes
Random Forest
ClassifierPrediction
Output
AnnotationOutput
Category Features
EvolutionaryGERP score12, paralogs13, pseudogenes14 and other gene annotation metrics, dN/dS rates, average heterozygosity of nsSNVs and synonymous SNVs, nonsynonymous and synonymous SNP density
FunctionalTranscript length, variant affecting all/some transcript isoforms, single exon gene, protein domain annotations13, GTeX expression in individual tissues15
NetworkProtein-protein interactions (BIOGRID)16, number of networks and interfaces, interactions with known dominant or recessive disease causing genes (OMIM®)6
Allele Frequency
1000G17, ESP650018, ExAC19, pLI score by gene (measure of haploinsufficiency based on constraint, ExAC)20
Table 1. Features annotated within MAPPIN and used for predictionssubdivided into categories based on the type of annotation.
Figure 1. Adapted from Li etal., PLoS Genetics, 20137.Several prediction algorithmsand combinations were testedon a dataset composed ofknown dominant and recessivedisease causing mutations.Figure clearly demonstratesthat existing algorithms areunable to call dominant orrecessive mutations confidently(AUCs ~0.55).
RESULTSMAPPIN trained under two models:i. Haploinsufficient model composed of genes causing dominant
diseases through haploinsufficiency (Multiclass AUC = 0.96)ii. All dominant model composed of all dominant disease-causing
genes (Multiclass AUC = 0.91)
FEATURE IMPORTANCE PLOT
PERFORMANCE OF FEATURE SUBSETS
VALIDATION ON TWO MENDELIAN DATASETS: CMG AND DDDS
Figure 2. Workflow for MAPPIN. User inputs a VCF file which is annotated with 99features and then run through a prediction model based on a random forestclassifier trained on benign and pathogenic variants.
DIFFERENTIATING BETWEEN DOMINANT & RECESSIVE DISEASE-CAUSING VARIANTS
DOMINANT AND RECESSIVE DISCRIMINATION FOR HGMD VARIANTS
Figure 6. Violin plots of score distributions for HGMD5 variants in dominant andrecessive genes. Training variants are excluded in the comparison and variants weresubset using genes from Berg et al.24 MAPPIN dominant (A) and recessive (B) classscores for HGMD variants in dominant and recessive genes. CADD (C) and Eigen (D)phred scores23 for HGMD variants in dominant and recessive genes.
REFERENCES1. Levy et al., The diploid genome sequence of an individual human. PLoS Biology, 20072. Ng et al., Genetic variation in an individual human exome. PLoS Genetics, 20083. Cargill, M et al., Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature
Genetics, 19994. Stephens JC et al., Haplotype variation and linkage disequilibrium in 313 human genes. Science, 20015. Stenson et al., The Human Gene Mutation Database: building a comprehensive mutation repository for clinical
and molecular genetics, diagnostic testing and personalized genomic medicine. Human Genetics, 20146. Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins
University (Baltimore, MD). World Wide Web URL: http://omim.org/7. Li et al., Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing
studies. PLoS Genetics, 20138. Blekhman, R. et al., Natural selection on genes that underlie human disease susceptibility. Current Biology, 20089. Boone, P.M. et al., Deletions of recessive disease genes: CNV contribution to carrier states and disease-
causing alleles. Genome Research, 201310. Huang et al., Characterising and predicting haploinsufficiency in the human genome. PLoS Genetics, 201011. Landrum MJ et al., ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids
Research, 201512. Cooper, G.M. et al., Distribution and intensity of constraint in mammalian genomic sequence. Genome
Research, 200513. Flicek et al., Ensembl 2014. Nucleic Acids Research, 201414. GENCODE, Pei, B. et al. The GENCODE pseudogene resource. Genome Biology, 201215. Lonsdale et al., The Genotype-Tissue Expression (GTEx) project. Nature Genetics, 201316. Stark, C. et al., BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 2006 (version
3.4.128)17. The 1000 Genomes Project Consortium, An integrated map of genetic variation from 1,092 human genomes.
Nature, 201218. Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL:
http://evs.gs.washington.edu/EVS/), 201519. Lek et al., Analysis of protein-coding genetic variation in 60,706 humans. Nature, 201620. Samocha et al., A framework for the interpretation of de novo mutation in human disease. Nature Genetics, 201421. Chong et al., The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. American
Journal of Human Genetics, 201522. Deciphering Developmental Disorders Study, Large-scale discovery of novel genetic causes of developmental
disorders. Nature, 201523. Liu et al., dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human
Nonsynonymous and Splice-Site SNVs. Human Mutation, 201624. Berg et al., An informatics approach to analyzing the incidentalome. Genetics in Medicine, 2013.
Figure 3. Precision and recall values for training data. A. Precision was calculated astrue positives over the sum of all positives for a 10-fold cross-validation. B. Recall wascalculated as the true positives over the sum of true positives and false negatives.
Benign
Recessive
Dominant
0.5
0.6
0.7
0.8
0.9
1.0
Precision 0.88 0.85 0.87
0.79 0.80
0.71
Benign
Recessive
Dominant
0.5
0.6
0.7
0.8
0.9
1.0
Recall
Haploinsufficient ModelAll Dominant Model
0.94 0.95
0.87
0.79
0.74
0.62
Figure 4. Feature importance plot. First, out-of-bag (OOB) prediction error iscalculated for each tree. Next, OOB error is calculated after permuting each feature.Finally, to derive the mean decrease in accuracy, the difference between the two isaveraged across all trees and normalized by the standard deviation of the differences.
Table 2. Precision values using different subsets of features. The precision valueswere calculated based on the training data under the haploinsufficient model using a10-fold cross-validation.
• 68 variants from the Centers for Mendelian Genomics (CMG)21, which are working towards identifying the genetic basis of Mendelian diseases
• 158 variants from the Deciphering Developmental Disorders Study (DDDS)22, which includes 1,133 children presenting with severe, undiagnosed developmental disorders of which 28% were identified with possibly pathogenic variants
Table 3. MAPPIN prediction accuracy for two Mendelian datasets. Table showing theprediction accuracies for pathogenicity and mode of inheritance for Mendelianvalidation datasets from CMG and DDDS. CMG (genes not in training) and DDDS(genes not in training) are pathogenicity and mode of inheritance results afterexcluding all CMG and DDDS genes from the training data.
Dataset Pathogenicity Prediction Accuracy
Inheritance Prediction Accuracy
CMG 68/68 (100%) 45/64 (70.3%)
DDDS 138/158 (87.3%) 124/158 (78.5%)
CMG (genes not in training) 68/68 (100%) 45/64 (70.3%)
DDDS (genes not in training) 138/158 (87.3%) 125/158 (79.1%)
MAPPIN FEATURES
A. B.
Figure 5. Violin plots of score distributions for CMG and DDDS dominant andrecessive disease-causing variants. MAPPIN dominant (A) and recessive (B) classscores for CMG and DDDS genes annotated as dominant or recessive. CADD (C) andEigen (D) phred scores23 for CMG and DDDS genes annotated as dominant orrecessive.
CONCLUSIONS & APPLICATIONS• To our knowledge, this is the first nsSNV prediction algorithm that
predicts pathogenicity and mode of inheritance by classifyingvariants into three groups
• Mode of inheritance predictions are useful because they allowtaking the genotype into account when prioritizing variants• Prevents carrier heterozygous variants to be classified as
equally pathogenic as dominant-acting heterozygous mutations
• In Mendelian family-based analysis, MAPPIN would be useful forvariant prioritization and interpretation especially in cases wherethere is not enough information to identify the inheritance pattern
• For population genetics, MAPPIN annotations and predictions cansupport interpretation of variant and phenotype associations andvariant aggregation for gene burden based association testing