whole-genome sequencing and the detection of disease

Whole-genome sequencing and the detection of disease-causing mutations

Lynn JordeDepartment of Human Genetics

University of Utah School of Medicine23 June 2012

Roach et al., 2010, Science

Overview

• First whole-genome sequencing of a human family• Disease gene detection• Mutation rate analysis

• Methods to uncover mutations in whole-genome data– Ogden syndrome– Crohn disease

Nature, April 1, 2010

Miller syndrome family(postaxial acrofacial dysostosis)

Craniofacial malformations

Limb malformations

Conductive hearing loss

Fineman, 1981, Journal of Pediatrics 98: 87-8

A second Miller syndrome family

Miller syndrome family history

CFTR genotyping

CFTR -ΔF508CFTR -ΔF508

CFTR -ΔF508 CFTR -ΔF508 CFTR -ΔF508

Exome and whole-genome sequencing identifies two autosomal recessive conditions

• Miller syndrome: DHODH, which encodes dihydroorotate dehydrogenase– involved in de novo synthesis of pyrimidines

• Primary ciliary dyskinesia: DNAH5, which encodes a dynein heavy chain needed for normal ciliary function– pulmonary infections, bronchiectasis– situs inversus

Ng et al., 2010, Nature GeneticsRoach et al., 2010, Science

DNA sequencing reveals that both affected children inherited five disease-causing

variants

Compoundheterozygotes

DHODH*DNAH5*

DHODH* (Miller syndrome)DNAH5* (Primary ciliary dyskinesia)ΔF508 (cystic fibrosis)

DHODH*/DHODH*DNAH5*/DNAH5*ΔF508

DHODH*/DHODH*DNAH5*/DNAH5*ΔF508

Ng et al., 2010, Nature GeneticsRoach et al., 2010, Science

11 March 2010 VOL 328 SCIENCE

Salt Lake Tribune, 11 March 2010

Nature6 Oct. 2011

• ~3 million single-nucleotide variants, ~10,000 amino acid-changing variants, and ~100 loss-of-function variants in an average individual

• Goal is to identify a tiny subset of these (e.g., de novo mutations or inherited disease-causing mutations)

• This presents a major analytic bottleneck

• Many tools are being used: SIFT, PolyPhen, ANNOVAR, SVA, etc.

The challenge of whole-genome sequences

An “all-in-one” tool is needed

VAAST: Variant Annotation, Analysis and Selection Tool

• Compares patient sequences with variant frequencies in sequence databases– SNPs and indels

• Assesses functional impact of amino acid substitutions (BLOSUM scores, OMIM)– Evidence of purifying selection used

• Analysis done on a gene-by-gene basis (21,000 comparisons instead of 1,000,000 in a GWAS)

• Can evaluate coding or noncoding regions• Incorporates pedigree data and LOD scores

Yandell, Huff, Hu, Singleton, Moore, Xing, Jorde, Reese, Genome Res. (July, 2011)

Composite likelihood ratio test

• Null hypothesis: variant does not cause disease

• Alternative hypothesis: variant causes disease

http://www.yandell-lab.org/software/vaast.html (freely available to academic researchers)

Family data improve disease-gene identification

• Exome analysis (Ng et al. 2010) used four affected individuals from three families to identify DHODH

• Whole-genome analysis of first family, incorporating parental sequences, identified four candidate loci (Roach et al. 2010)

• VAAST identifies just two candidate loci: DNAH5 and DHODH

DNAH5

DHODH

DHODH ranked 1st

DNAH5 ranked 2nd

out of 21,902 genes

VAAST Analysis of Miller kindred 1 after incorporating parents’ sequences : only two candidate genes

P < 1e‐3

P > 2.4e‐6

P < 2.4e‐6

Pgene

VAAST analysis of a lethal, X-linked condition

X chromosome exome sequencing of four individuals

Rope et al., Am. J. Hum. Genet. (July 2011)

“Progeria-like” features; cardiac arrhythmia; death before age one year

Rope et al., 2011, Am J Hum Genet

NAA10

X-chromosome wide significance(p=0.05/729=7x10-5)

Location along the X-chromosome

VAAST identifies a single candidate locus(analysis time: 20 minutes)

Mutation in NAA10 causes a new genetic disease: “Ogden syndrome”

NAA10NAA10: N(alpha)-acetyltransferase 10, part of a complex responsible for alpha-acetylation of peptides.

SB

13 family members genotyped for NAA10 mutation. In vitro analysis: loss-of-function effect.Genetic test developed.

What about more common diseases?Crohn disease in a Utah kindred

Stephen Guthery, MD, Dept. of Pediatrics

P < 0.001 for observed vs. expected affected individuals

>3 meioses to founder

b. 1800sb. ~1800s


?

Crohn

Psoriasis

Ulcerative colitis

Chronic diarrhea

Crohn+vitiligo+psoriasis


b. 1800sb. ~1800s


?

Crohn

Psoriasis

Ulcerative colitis

Chronic diarrhea

Crohn + vitiligo

Exome sequencing of 3 individuals

Using VAAST: Thousands of candidates narrowed to 93

Individual 1 Individual 2 Individual 3

Total number of SNVs 50,664 49,724 43,081

Number of variants predicted to cause amino acid changes 10,506 10,175 9,026

Number of shared variants predicted to cause amino acid changes 5302

Number of VAAST candidates (p=6.5x10-8) 93

?

?

Shared segment on Chromosome 5

Two shared segments inherited from common ancestors:12cM chromosome 514 cM chromsome 18

Using shared genomic segments: from 93 candidates to a single candidate

• A single missense variant in LRRC70, a gene with substantial sequence identity to toll‐like receptor genes involved in innate immunity

• Amino acid is highly conserved, and LRRC70 appears in GWAS

Summary

• Exome and whole‐genome sequencing methods effectively identify disease‐causing mutations

• New analytic methods, such as VAAST, greatly improve our capacity to identify these mutations– Incorporation of family data: “real genetics”

University of Utah: Chad Huff, Jinchuan Xing, Dave Witherspoon, Steve Guthery, Scott Watkins, Mark Yandell, Barry Moore, Hao Hu, Alan Rope, John Carey, John OpitzInstitute for Systems Biology: Jared Roach, David Galas, Lee HoodChildren’s Hospital of Philadelphia: Gholson Lyon, Hakon HakonarsonUniversity of Washington: Mike Bamshad, Jay ShendureNIH: Les Biesecker

Acknowledgments

Pedigree‐VAAST (pVAAST)

Pedigree‐VAAST (pVAAST)

Incorporates inheritance pattern information directly into the likelihood ratio

LODi: LOD score for this gene for the ith family.

• LOD scores calculated for each potential “causal” allele in a family.

VAAST 1.0 Cardiac Septal Defect Results (21,000 genes)

GATA4

Location along the genomeRun time: 24 hours on 42 CPUs

VAA

ST

1.0

log

p-va

lue

pVAAST Cardiac Septal Defect Results (21,000 genes)

GATA4

Location along the genomeRun time: 4 hours on 42 CPUs

pVA

AS

Tlo

g p-

valu

e

Genome-wide significance(p=0.05/21000=2.4x10-6)

Male‐specific Mutation Rate Estimation

• Identify IBD regions ( ).• Relatives provide paternal and offspring genome

“replicates” to reduce false positives.• 12 mutations in 600 Mb of sequence data.• Estimated male-specific mutation rate of 2.0x10-8 per base pair per

generation (compared to sex-averaged rate of 1.1 x 10-8)• Male mutation rate is ~5x the female rate.

Male‐specific Mutation Rate Estimation

VAAST greatly improves gene prioritization compared to standard hard filtering (e.g. ANNOVAR)

Yandell et al. Genome Research, 2011

True

dis

ease

gen

e ra

nk

Average genome-wide rank of the true-disease causing gene for 100 different Mendelian disease genes. Approx. 21,000 genes tested per disease gene. 190

control genomes.

9 3

An early estimate of the human mutation rate

• Estimated mutation rate, μ, for hemophilia = 2 x 10-5 per locus per generation (Haldane, 1935, “The rate of spontaneousmutation of a human gene” J. Genet. 31: 21-30)

• If 1kb of factor VIII gene is functionally significant, μ = 2 x 10-8

• Our estimate of μ was 1.1 x 10-8 (Roach et al., 2010)

• Haldane (1947) estimated that the male mutation rate is up to 10 times higher than that of the female

New Questions

• Direct estimate of paternal age effect on mutation

• Variation in mutation rates– Variation in DNA repair genes

– Coding vs. various types of non‐coding DNA

– Effects of nearby sequence motifs, indels, etc.

– Population variation

Identical NAA10 mutation found in a second family

Rope et al., 2011, Am. J. Hum. Genet.

ERSA (Huff et al., 2011, Genome Res.) shows the two families are unrelated; the mutation is at least 100 generations old

VAAST greatly improves gene prioritization compared to standard hard filtering (e.g.

ANNOVAR)

Yandell M et al. Genome Research, 2011

True

dis

ease

gen

e ra

nk

Average genome-wide rank of the true-disease causing gene for 100 different Mendelian disease genes. Approx. 21,000 genes tested per disease gene. 190

control genomes.

9 3

Identifying a gene responsible for a lethal X-linked disorder

Rope AF, et al. Am J Hum Genet, 2011.

NAA10

X-chromosome wide significance(p=0.05/729=7x10-5)

Previously undescribed X-linked lethal disordercharacterized by aged appearance, severe developmental delay, and cardiac arrhythmias.

Location along the X-chromosome

LOD score needed for genome-wide significance for GATA4

Rare SNPs are much more likely to be population-specific

Common SNPs previously identified in dbSNP

New rarer SNPs identified by sequencing

1000 Genomes Project, 2010

Mutation rate estimation• 33,937 Mendelian inheritance errors in 1.8

Gb of analyzed sequence in two offspring– Only a few dozen mutations are expected– 99.9% of mutation candidates are

sequencing errors• All were resequenced

– Agilent custom capture array – Illumina GA II sequencing– High-stringency base calling to minimize false

positives• Maq• Binomial method

Mutation rate estimation• All but 28 of the Mendelian inheritance errors

were excluded – all 28 confirmed by mass spectrometry

• Estimate the false negative rate by comparing whole-genome sequence with targeted resequencing results– 5 Mb from 25 randomly selected genomic regions

(200 kb each)• Mutation rate = 1.1 x 10-8 (6.8 x 10-9 – 1.7 x

10-8) /nucleotide/generation: each gamete has ~35 new variants

Other mutation rate estimatesRate (x 10-8)

Source

2.5 18 processed pseudogenes (16 kb) in human v. chimp

Nachman and Crowell, 2000, Genetics

1.8 20 coding loci (74 kb) Kondrashov, 2003, Hum. Mutat.

1.5 DMD locus (85 kb) Flanigan et al., 2009, Hum. Mutat.

1.3 39 coding loci Lynch, 2010, PNAS

1.4 294 Mb noncoding seq. Awadalla et al., 2010, AJHG

1.0; 1.2 Two resequenced trios 1000 Genomes Project, 2010, Nature

ANNOVAR: Wang et al. 2010 Nucl. Acids Res.

Filtering variants

CMCM CM

+

+ (in one or more offspring)

--

----

- Candidate mutation absent+ Candidate mutation present

• First exclude all base pairs:– No called in any member of the pedigree– Polymorphic in any other Complete Genomics genome and any

mutation present in the 2nd and 3rd generation• Next identify all genomic regions IBD between 2nd, 3rd, and

4th generation using Shared Genomic Segments (700 Mb)

1

2

3

4

Male‐specific mutation rate estimation

Direct estimation of the mutation rate in a family quartet

CMCM CM

Male‐specific mutation rate estimation

+

+ (in one or more offspring)

--

----

- Candidate mutation absent+ Candidate mutation present

• 11 mutations in 700 Mb of sequence

• Estimated mutation rate: 1.6x10‐8 per nucleotide

• Estimated ratio of male‐to‐female mutation rates: 2.5

Accuracy of mutation rate estimate

0 0,5 1 1,5 2

Prob

ability

Estimated mutation rate (x10‐8)

Miller Syndrome family

0 0,5 1 1,5 2

Prob

ability


.. and two 1000 Genomes trios

0 0,5 1 1,5 2

Prob

ability


Projected: 20 trios

0 0,5 1 1,5 2

Prob

ability


Projected: 20 trios?

Mutation rate estimation• All but 28 of the Mendelian inheritance

errors were excluded – all 28 confirmed by mass spectrometry

• Estimate the false negative rate by comparing whole-genome sequence with targeted resequencing results– 5 Mb from 25 randomly selected genomic

regions (200 kb each)• Mutation rate = 1.1 x 10-8

/nucleotide/generation: each gamete has 35 new variants

Whole-genome genetic variation

• One or more family members differed from NCBI reference sequence (Build 36.1) at 4.5 million positions

• Nucleotide diversity (θ), comparing two parents to reference sequence = 9.5 x 10-4

– Watson and Venter vs. NCBI reference: θ = 9.3 x 10-4

• 323,255 novel SNPs– Family analysis showed that about 20% of

these apparent SNPs are sequencing errors

uncalled

one allele called

bothallelescalled

Coverage of Family of Four by Region Type

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Exome Unique CNVs Repeats (Int/Old)

Repeats (Int/Young)

Repeats (Simple/Old)

Repeats (Simple/Young)

1.8 Gb of data useful for mutation detection

Pattern of a de novo mutation

R/R R/R

R/N

R/R R/R

R/N

R: human reference alleleN: non-reference allele

“Scrambled trios”

R/R

R/R

R/N

R/R

R/N

R/R

R/R

R/N

R/R

R/N

R/R R/R

R/N

R/RR/R

R/N

R/R R/R

R/R R/R

R/N

R/R

R/N

R/R R/R

R/N

R/R

R/N

R/RR/R

2768 sites matched one of these patterns

At 1832 sites, at least one genotype could not be confirmed

False negative rate = 1832/2768 = 66.2%

Controls

Cases

Variant-based vs. feature-based prioritization

What is the likelihood of observing the composite genotype at Gene X in cases given the controls?

Gene X

C/T: 0.8/0.2G/C: 0.6/0.4

C/T: 0.1/0.9G/C: 0.4/0.6C/T: 0.2/0.8, S-> *

Bonferroni correction for number of features (~21,000 human genes).

Mutation rate for the factor VIII gene (hemophilia A)

• Frequency, f, of males in London with hemophilia: 4 x 10-5 to 17 x 10-5 (=q)

• Selection coefficient, s, = .75 to .90• Then, µ = sq/3 = 10-5 to 5 x 10-5

• Estimated mutation rate, µ, = 2 x 10-5

per locus per generation

Haldane, JBS, 1935, “The rate of spontaneous mutation of a human gene” J. Genet. 31: 21-30

High-resolution human recombination map

• Median resolution of crossover sites was 2.6 kb

• 155 crossovers in four meioses– 98 maternal– 57 paternal

• 92 crossovers (~60%) located within a HapMap-defined recombination hotspot

Larger chimp-human ancestral population size accommodates earlier coalescence: more generations in which mutations can occur

Ne = 10,000

Ne = 100,000

Chimp Human Chimp Human

Characterizing the mutations• None were known SNPs (dbSNP, 1000

Genomes)• Each was seen in only one of two sibs• Mutations at CpG sites

– Expected = 5.04 of 28 (based on 10x elevation in rate)

– Observed: 5 of 28 • Non-CpG transition/transversion rate: 2.3

Reconciling the direct mutation estimate with the phylogenetic estimate

• Mutation rate of 2.5 x 10-8 assumes:– Human-chimp divergence of 5 MYA– Small effective population size of human-

chimp common ancestor (10,000)• Assuming a divergence time of 6 MYA and

effective size of 100,000, rate =1.3 x 10-8

whole-genome sequencing and the detection of disease

Documents