whole-genome sequencing and the detection of disease
TRANSCRIPT
Whole-genome sequencing and the detection of disease-causing mutations
Lynn JordeDepartment of Human Genetics
University of Utah School of Medicine23 June 2012
Roach et al., 2010, Science
Overview
• First whole-genome sequencing of a human family• Disease gene detection• Mutation rate analysis
• Methods to uncover mutations in whole-genome data– Ogden syndrome– Crohn disease
Nature, April 1, 2010
Miller syndrome family(postaxial acrofacial dysostosis)
Craniofacial malformations
Limb malformations
Conductive hearing loss
Fineman, 1981, Journal of Pediatrics 98: 87-8
A second Miller syndrome family
Miller syndrome family history
CFTR genotyping
CFTR -ΔF508CFTR -ΔF508
CFTR -ΔF508 CFTR -ΔF508 CFTR -ΔF508
Exome and whole-genome sequencing identifies two autosomal recessive conditions
• Miller syndrome: DHODH, which encodes dihydroorotate dehydrogenase– involved in de novo synthesis of pyrimidines
• Primary ciliary dyskinesia: DNAH5, which encodes a dynein heavy chain needed for normal ciliary function– pulmonary infections, bronchiectasis– situs inversus
Ng et al., 2010, Nature GeneticsRoach et al., 2010, Science
DNA sequencing reveals that both affected children inherited five disease-causing
variants
Compoundheterozygotes
DHODH*DNAH5*
DHODH* (Miller syndrome)DNAH5* (Primary ciliary dyskinesia)ΔF508 (cystic fibrosis)
DHODH*/DHODH*DNAH5*/DNAH5*ΔF508
DHODH*/DHODH*DNAH5*/DNAH5*ΔF508
Ng et al., 2010, Nature GeneticsRoach et al., 2010, Science
11 March 2010 VOL 328 SCIENCE
Salt Lake Tribune, 11 March 2010
Nature6 Oct. 2011
• ~3 million single-nucleotide variants, ~10,000 amino acid-changing variants, and ~100 loss-of-function variants in an average individual
• Goal is to identify a tiny subset of these (e.g., de novo mutations or inherited disease-causing mutations)
• This presents a major analytic bottleneck
• Many tools are being used: SIFT, PolyPhen, ANNOVAR, SVA, etc.
The challenge of whole-genome sequences
An “all-in-one” tool is needed
VAAST: Variant Annotation, Analysis and Selection Tool
• Compares patient sequences with variant frequencies in sequence databases– SNPs and indels
• Assesses functional impact of amino acid substitutions (BLOSUM scores, OMIM)– Evidence of purifying selection used
• Analysis done on a gene-by-gene basis (21,000 comparisons instead of 1,000,000 in a GWAS)
• Can evaluate coding or noncoding regions• Incorporates pedigree data and LOD scores
Yandell, Huff, Hu, Singleton, Moore, Xing, Jorde, Reese, Genome Res. (July, 2011)
Composite likelihood ratio test
• Null hypothesis: variant does not cause disease
• Alternative hypothesis: variant causes disease
http://www.yandell-lab.org/software/vaast.html (freely available to academic researchers)
Family data improve disease-gene identification
• Exome analysis (Ng et al. 2010) used four affected individuals from three families to identify DHODH
• Whole-genome analysis of first family, incorporating parental sequences, identified four candidate loci (Roach et al. 2010)
• VAAST identifies just two candidate loci: DNAH5 and DHODH
DNAH5
DHODH
DHODH ranked 1st
DNAH5 ranked 2nd
out of 21,902 genes
VAAST Analysis of Miller kindred 1 after incorporating parents’ sequences : only two candidate genes
P < 1e‐3
P > 2.4e‐6
P < 2.4e‐6
Pgene
VAAST analysis of a lethal, X-linked condition
X chromosome exome sequencing of four individuals
Rope et al., Am. J. Hum. Genet. (July 2011)
“Progeria-like” features; cardiac arrhythmia; death before age one year
Rope et al., 2011, Am J Hum Genet
NAA10
X-chromosome wide significance(p=0.05/729=7x10-5)
Location along the X-chromosome
VAAST identifies a single candidate locus(analysis time: 20 minutes)
Mutation in NAA10 causes a new genetic disease: “Ogden syndrome”
NAA10NAA10: N(alpha)-acetyltransferase 10, part of a complex responsible for alpha-acetylation of peptides.
SB
13 family members genotyped for NAA10 mutation. In vitro analysis: loss-of-function effect.Genetic test developed.
What about more common diseases?Crohn disease in a Utah kindred
Stephen Guthery, MD, Dept. of Pediatrics
P < 0.001 for observed vs. expected affected individuals
>3 meioses to founder
b. 1800sb. ~1800s
>3 meioses to founder
?
Crohn
Psoriasis
Ulcerative colitis
Chronic diarrhea
Crohn+vitiligo+psoriasis
>3 meioses to founder
b. 1800sb. ~1800s
>3 meioses to founder
?
Crohn
Psoriasis
Ulcerative colitis
Chronic diarrhea
Crohn + vitiligo
Exome sequencing of 3 individuals
Using VAAST: Thousands of candidates narrowed to 93
Individual 1 Individual 2 Individual 3
Total number of SNVs 50,664 49,724 43,081
Number of variants predicted to cause amino acid changes 10,506 10,175 9,026
Number of shared variants predicted to cause amino acid changes 5302
Number of VAAST candidates (p=6.5x10-8) 93
?
?
Shared segment on Chromosome 5
Two shared segments inherited from common ancestors:12cM chromosome 514 cM chromsome 18
Using shared genomic segments: from 93 candidates to a single candidate
• A single missense variant in LRRC70, a gene with substantial sequence identity to toll‐like receptor genes involved in innate immunity
• Amino acid is highly conserved, and LRRC70 appears in GWAS
Summary
• Exome and whole‐genome sequencing methods effectively identify disease‐causing mutations
• New analytic methods, such as VAAST, greatly improve our capacity to identify these mutations– Incorporation of family data: “real genetics”
University of Utah: Chad Huff, Jinchuan Xing, Dave Witherspoon, Steve Guthery, Scott Watkins, Mark Yandell, Barry Moore, Hao Hu, Alan Rope, John Carey, John OpitzInstitute for Systems Biology: Jared Roach, David Galas, Lee HoodChildren’s Hospital of Philadelphia: Gholson Lyon, Hakon HakonarsonUniversity of Washington: Mike Bamshad, Jay ShendureNIH: Les Biesecker
Acknowledgments
Pedigree‐VAAST (pVAAST)
Pedigree‐VAAST (pVAAST)
Incorporates inheritance pattern information directly into the likelihood ratio
LODi: LOD score for this gene for the ith family.
• LOD scores calculated for each potential “causal” allele in a family.
VAAST 1.0 Cardiac Septal Defect Results (21,000 genes)
GATA4
Location along the genomeRun time: 24 hours on 42 CPUs
VAA
ST
1.0
log
p-va
lue
pVAAST Cardiac Septal Defect Results (21,000 genes)
GATA4
Location along the genomeRun time: 4 hours on 42 CPUs
pVA
AS
Tlo
g p-
valu
e
Genome-wide significance(p=0.05/21000=2.4x10-6)
Male‐specific Mutation Rate Estimation
Male‐specific Mutation Rate Estimation
Male‐specific Mutation Rate Estimation
• Identify IBD regions ( ).• Relatives provide paternal and offspring genome
“replicates” to reduce false positives.• 12 mutations in 600 Mb of sequence data.• Estimated male-specific mutation rate of 2.0x10-8 per base pair per
generation (compared to sex-averaged rate of 1.1 x 10-8)• Male mutation rate is ~5x the female rate.
Male‐specific Mutation Rate Estimation
VAAST greatly improves gene prioritization compared to standard hard filtering (e.g. ANNOVAR)
Yandell et al. Genome Research, 2011
True
dis
ease
gen
e ra
nk
Average genome-wide rank of the true-disease causing gene for 100 different Mendelian disease genes. Approx. 21,000 genes tested per disease gene. 190
control genomes.
9 3
An early estimate of the human mutation rate
• Estimated mutation rate, μ, for hemophilia = 2 x 10-5 per locus per generation (Haldane, 1935, “The rate of spontaneousmutation of a human gene” J. Genet. 31: 21-30)
• If 1kb of factor VIII gene is functionally significant, μ = 2 x 10-8
• Our estimate of μ was 1.1 x 10-8 (Roach et al., 2010)
• Haldane (1947) estimated that the male mutation rate is up to 10 times higher than that of the female
New Questions
• Direct estimate of paternal age effect on mutation
• Variation in mutation rates– Variation in DNA repair genes
– Coding vs. various types of non‐coding DNA
– Effects of nearby sequence motifs, indels, etc.
– Population variation
Identical NAA10 mutation found in a second family
Rope et al., 2011, Am. J. Hum. Genet.
ERSA (Huff et al., 2011, Genome Res.) shows the two families are unrelated; the mutation is at least 100 generations old
VAAST greatly improves gene prioritization compared to standard hard filtering (e.g.
ANNOVAR)
Yandell M et al. Genome Research, 2011
True
dis
ease
gen
e ra
nk
Average genome-wide rank of the true-disease causing gene for 100 different Mendelian disease genes. Approx. 21,000 genes tested per disease gene. 190
control genomes.
9 3
Identifying a gene responsible for a lethal X-linked disorder
Rope AF, et al. Am J Hum Genet, 2011.
NAA10
X-chromosome wide significance(p=0.05/729=7x10-5)
Previously undescribed X-linked lethal disordercharacterized by aged appearance, severe developmental delay, and cardiac arrhythmias.
Location along the X-chromosome
LOD score needed for genome-wide significance for GATA4
Rare SNPs are much more likely to be population-specific
Common SNPs previously identified in dbSNP
New rarer SNPs identified by sequencing
1000 Genomes Project, 2010
Mutation rate estimation• 33,937 Mendelian inheritance errors in 1.8
Gb of analyzed sequence in two offspring– Only a few dozen mutations are expected– 99.9% of mutation candidates are
sequencing errors• All were resequenced
– Agilent custom capture array – Illumina GA II sequencing– High-stringency base calling to minimize false
positives• Maq• Binomial method
Mutation rate estimation• All but 28 of the Mendelian inheritance errors
were excluded – all 28 confirmed by mass spectrometry
• Estimate the false negative rate by comparing whole-genome sequence with targeted resequencing results– 5 Mb from 25 randomly selected genomic regions
(200 kb each)• Mutation rate = 1.1 x 10-8 (6.8 x 10-9 – 1.7 x
10-8) /nucleotide/generation: each gamete has ~35 new variants
Other mutation rate estimatesRate (x 10-8)
Source
2.5 18 processed pseudogenes (16 kb) in human v. chimp
Nachman and Crowell, 2000, Genetics
1.8 20 coding loci (74 kb) Kondrashov, 2003, Hum. Mutat.
1.5 DMD locus (85 kb) Flanigan et al., 2009, Hum. Mutat.
1.3 39 coding loci Lynch, 2010, PNAS
1.4 294 Mb noncoding seq. Awadalla et al., 2010, AJHG
1.0; 1.2 Two resequenced trios 1000 Genomes Project, 2010, Nature
ANNOVAR: Wang et al. 2010 Nucl. Acids Res.
Filtering variants
CMCM CM
+
+ (in one or more offspring)
--
----
- Candidate mutation absent+ Candidate mutation present
• First exclude all base pairs:– No called in any member of the pedigree– Polymorphic in any other Complete Genomics genome and any
mutation present in the 2nd and 3rd generation• Next identify all genomic regions IBD between 2nd, 3rd, and
4th generation using Shared Genomic Segments (700 Mb)
1
2
3
4
Male‐specific mutation rate estimation
Direct estimation of the mutation rate in a family quartet
CMCM CM
Male‐specific mutation rate estimation
+
+ (in one or more offspring)
--
----
- Candidate mutation absent+ Candidate mutation present
• 11 mutations in 700 Mb of sequence
• Estimated mutation rate: 1.6x10‐8 per nucleotide
• Estimated ratio of male‐to‐female mutation rates: 2.5
Accuracy of mutation rate estimate
0 0,5 1 1,5 2
Prob
ability
Estimated mutation rate (x10‐8)
Miller Syndrome family
0 0,5 1 1,5 2
Prob
ability
Estimated mutation rate (x10‐8)
.. and two 1000 Genomes trios
0 0,5 1 1,5 2
Prob
ability
Estimated mutation rate (x10‐8)
Projected: 20 trios
0 0,5 1 1,5 2
Prob
ability
Estimated mutation rate (x10‐8)
Projected: 20 trios?
Mutation rate estimation• All but 28 of the Mendelian inheritance
errors were excluded – all 28 confirmed by mass spectrometry
• Estimate the false negative rate by comparing whole-genome sequence with targeted resequencing results– 5 Mb from 25 randomly selected genomic
regions (200 kb each)• Mutation rate = 1.1 x 10-8
/nucleotide/generation: each gamete has 35 new variants
Whole-genome genetic variation
• One or more family members differed from NCBI reference sequence (Build 36.1) at 4.5 million positions
• Nucleotide diversity (θ), comparing two parents to reference sequence = 9.5 x 10-4
– Watson and Venter vs. NCBI reference: θ = 9.3 x 10-4
• 323,255 novel SNPs– Family analysis showed that about 20% of
these apparent SNPs are sequencing errors
uncalled
one allele called
bothallelescalled
Coverage of Family of Four by Region Type
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Mom Dad
Dau Son
Mom Dad
Dau Son
Mom Dad
Dau Son
Mom Dad
Dau Son
Mom Dad
Dau Son
Mom Dad
Dau Son
Mom Dad
Dau Son
Exome Unique CNVs Repeats (Int/Old)
Repeats (Int/Young)
Repeats (Simple/Old)
Repeats (Simple/Young)
1.8 Gb of data useful for mutation detection
Pattern of a de novo mutation
R/R R/R
R/N
R/R R/R
R/N
R: human reference alleleN: non-reference allele
“Scrambled trios”
R/R
R/R
R/N
R/R
R/N
R/R
R/R
R/N
R/R
R/N
R/R R/R
R/N
R/RR/R
R/N
R/R R/R
R/R R/R
R/N
R/R
R/N
R/R R/R
R/N
R/R
R/N
R/RR/R
2768 sites matched one of these patterns
At 1832 sites, at least one genotype could not be confirmed
False negative rate = 1832/2768 = 66.2%
Controls
Cases
Variant-based vs. feature-based prioritization
What is the likelihood of observing the composite genotype at Gene X in cases given the controls?
Gene X
C/T: 0.8/0.2G/C: 0.6/0.4
C/T: 0.1/0.9G/C: 0.4/0.6C/T: 0.2/0.8, S-> *
Bonferroni correction for number of features (~21,000 human genes).
Mutation rate for the factor VIII gene (hemophilia A)
• Frequency, f, of males in London with hemophilia: 4 x 10-5 to 17 x 10-5 (=q)
• Selection coefficient, s, = .75 to .90• Then, µ = sq/3 = 10-5 to 5 x 10-5
• Estimated mutation rate, µ, = 2 x 10-5
per locus per generation
Haldane, JBS, 1935, “The rate of spontaneous mutation of a human gene” J. Genet. 31: 21-30
High-resolution human recombination map
• Median resolution of crossover sites was 2.6 kb
• 155 crossovers in four meioses– 98 maternal– 57 paternal
• 92 crossovers (~60%) located within a HapMap-defined recombination hotspot
Larger chimp-human ancestral population size accommodates earlier coalescence: more generations in which mutations can occur
Ne = 10,000
Ne = 100,000
Chimp Human Chimp Human
Characterizing the mutations• None were known SNPs (dbSNP, 1000
Genomes)• Each was seen in only one of two sibs• Mutations at CpG sites
– Expected = 5.04 of 28 (based on 10x elevation in rate)
– Observed: 5 of 28 • Non-CpG transition/transversion rate: 2.3
Reconciling the direct mutation estimate with the phylogenetic estimate
• Mutation rate of 2.5 x 10-8 assumes:– Human-chimp divergence of 5 MYA– Small effective population size of human-
chimp common ancestor (10,000)• Assuming a divergence time of 6 MYA and
effective size of 100,000, rate =1.3 x 10-8