whole-genome sequencing and the detection of disease

70
Whole-genome sequencing and the detection of disease-causing mutations Lynn Jorde Department of Human Genetics University of Utah School of Medicine 23 June 2012 Roach et al., 2010, Science

Upload: others

Post on 24-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Whole-genome sequencing and the detection of disease

Whole-genome sequencing and the detection of disease-causing mutations

Lynn JordeDepartment of Human Genetics

University of Utah School of Medicine23 June 2012

Roach et al., 2010, Science

Page 2: Whole-genome sequencing and the detection of disease

Overview

• First whole-genome sequencing of a human family• Disease gene detection• Mutation rate analysis

• Methods to uncover mutations in whole-genome data– Ogden syndrome– Crohn disease

Page 3: Whole-genome sequencing and the detection of disease

Nature, April 1, 2010

Page 4: Whole-genome sequencing and the detection of disease
Page 5: Whole-genome sequencing and the detection of disease

Miller syndrome family(postaxial acrofacial dysostosis)

Craniofacial malformations

Limb malformations

Conductive hearing loss

Page 6: Whole-genome sequencing and the detection of disease

Fineman, 1981, Journal of Pediatrics 98: 87-8

Page 7: Whole-genome sequencing and the detection of disease
Page 8: Whole-genome sequencing and the detection of disease
Page 9: Whole-genome sequencing and the detection of disease

A second Miller syndrome family

Page 10: Whole-genome sequencing and the detection of disease

Miller syndrome family history

Page 11: Whole-genome sequencing and the detection of disease

CFTR genotyping

CFTR -ΔF508CFTR -ΔF508

CFTR -ΔF508 CFTR -ΔF508 CFTR -ΔF508

Page 12: Whole-genome sequencing and the detection of disease

Exome and whole-genome sequencing identifies two autosomal recessive conditions

• Miller syndrome: DHODH, which encodes dihydroorotate dehydrogenase– involved in de novo synthesis of pyrimidines

• Primary ciliary dyskinesia: DNAH5, which encodes a dynein heavy chain needed for normal ciliary function– pulmonary infections, bronchiectasis– situs inversus

Ng et al., 2010, Nature GeneticsRoach et al., 2010, Science

Page 13: Whole-genome sequencing and the detection of disease

DNA sequencing reveals that both affected children inherited five disease-causing

variants

Compoundheterozygotes

DHODH*DNAH5*

DHODH* (Miller syndrome)DNAH5* (Primary ciliary dyskinesia)ΔF508 (cystic fibrosis)

DHODH*/DHODH*DNAH5*/DNAH5*ΔF508

DHODH*/DHODH*DNAH5*/DNAH5*ΔF508

Ng et al., 2010, Nature GeneticsRoach et al., 2010, Science

Page 14: Whole-genome sequencing and the detection of disease

11 March 2010 VOL 328 SCIENCE

Page 15: Whole-genome sequencing and the detection of disease

Salt Lake Tribune, 11 March 2010

Page 16: Whole-genome sequencing and the detection of disease
Page 17: Whole-genome sequencing and the detection of disease

Nature6 Oct. 2011

Page 18: Whole-genome sequencing and the detection of disease

• ~3 million single-nucleotide variants, ~10,000 amino acid-changing variants, and ~100 loss-of-function variants in an average individual

• Goal is to identify a tiny subset of these (e.g., de novo mutations or inherited disease-causing mutations)

• This presents a major analytic bottleneck

• Many tools are being used: SIFT, PolyPhen, ANNOVAR, SVA, etc.

The challenge of whole-genome sequences

Page 19: Whole-genome sequencing and the detection of disease

An “all-in-one” tool is needed

Page 20: Whole-genome sequencing and the detection of disease

VAAST: Variant Annotation, Analysis and Selection Tool

• Compares patient sequences with variant frequencies in sequence databases– SNPs and indels

• Assesses functional impact of amino acid substitutions (BLOSUM scores, OMIM)– Evidence of purifying selection used

• Analysis done on a gene-by-gene basis (21,000 comparisons instead of 1,000,000 in a GWAS)

• Can evaluate coding or noncoding regions• Incorporates pedigree data and LOD scores

Yandell, Huff, Hu, Singleton, Moore, Xing, Jorde, Reese, Genome Res. (July, 2011)

Page 21: Whole-genome sequencing and the detection of disease

Composite likelihood ratio test

• Null hypothesis: variant does not cause disease

• Alternative hypothesis: variant causes disease

http://www.yandell-lab.org/software/vaast.html (freely available to academic researchers)

Page 22: Whole-genome sequencing and the detection of disease

Family data improve disease-gene identification

• Exome analysis (Ng et al. 2010) used four affected individuals from three families to identify DHODH

• Whole-genome analysis of first family, incorporating parental sequences, identified four candidate loci (Roach et al. 2010)

• VAAST identifies just two candidate loci: DNAH5 and DHODH

Page 23: Whole-genome sequencing and the detection of disease

DNAH5

DHODH

DHODH  ranked 1st

DNAH5   ranked 2nd

out of 21,902 genes

VAAST Analysis of Miller kindred 1 after incorporating parents’ sequences : only two candidate genes

P < 1e‐3

P > 2.4e‐6

P < 2.4e‐6

Pgene

Page 24: Whole-genome sequencing and the detection of disease

VAAST analysis of a lethal, X-linked condition

X chromosome exome sequencing of four individuals

Rope et al., Am. J. Hum. Genet. (July 2011)

“Progeria-like” features; cardiac arrhythmia; death before age one year

Page 25: Whole-genome sequencing and the detection of disease

Rope et al., 2011, Am J Hum Genet

NAA10

X-chromosome wide significance(p=0.05/729=7x10-5)

Location along the X-chromosome

VAAST identifies a single candidate locus(analysis time: 20 minutes)

Page 26: Whole-genome sequencing and the detection of disease

Mutation in NAA10 causes a new genetic disease: “Ogden syndrome”

NAA10NAA10: N(alpha)-acetyltransferase 10, part of a complex responsible for alpha-acetylation of peptides.

SB

13 family members genotyped for NAA10 mutation. In vitro analysis: loss-of-function effect.Genetic test developed.

Page 27: Whole-genome sequencing and the detection of disease

What about more common diseases?Crohn disease in a Utah kindred

Stephen Guthery, MD, Dept. of Pediatrics

P < 0.001 for observed vs. expected affected individuals

Page 28: Whole-genome sequencing and the detection of disease

>3 meioses to founder

b. 1800sb. ~1800s

>3 meioses to founder

?

Crohn

Psoriasis

Ulcerative colitis

Chronic diarrhea

Crohn+vitiligo+psoriasis

Page 29: Whole-genome sequencing and the detection of disease

>3 meioses to founder

b. 1800sb. ~1800s

>3 meioses to founder

?

Crohn

Psoriasis

Ulcerative colitis

Chronic diarrhea

Crohn + vitiligo

Exome sequencing of 3 individuals

Page 30: Whole-genome sequencing and the detection of disease

Using VAAST:  Thousands of candidates narrowed to 93

Individual 1 Individual 2 Individual 3

Total number of SNVs 50,664 49,724 43,081

Number of variants predicted to cause amino acid changes 10,506 10,175 9,026

Number of shared variants predicted to cause amino acid changes 5302

Number of VAAST candidates (p=6.5x10-8) 93

?

Page 31: Whole-genome sequencing and the detection of disease

?

Shared segment on Chromosome 5

Two shared segments inherited from common ancestors:12cM chromosome 514 cM chromsome 18

Page 32: Whole-genome sequencing and the detection of disease

Using shared genomic segments:  from 93 candidates to a single candidate

• A single missense variant in LRRC70, a gene with substantial sequence identity to toll‐like receptor genes involved in innate immunity

• Amino acid is highly conserved, and LRRC70 appears in GWAS

Page 33: Whole-genome sequencing and the detection of disease

Summary

• Exome and whole‐genome sequencing methods effectively identify disease‐causing mutations

• New analytic methods, such as VAAST, greatly improve our capacity to identify these mutations– Incorporation of family data: “real genetics”

Page 34: Whole-genome sequencing and the detection of disease

University of Utah: Chad Huff, Jinchuan Xing, Dave Witherspoon, Steve Guthery, Scott Watkins, Mark Yandell, Barry Moore, Hao Hu, Alan Rope, John Carey, John OpitzInstitute for Systems Biology: Jared Roach, David Galas, Lee HoodChildren’s Hospital of Philadelphia: Gholson Lyon, Hakon HakonarsonUniversity of Washington: Mike Bamshad, Jay ShendureNIH: Les Biesecker

Acknowledgments

Page 35: Whole-genome sequencing and the detection of disease

Pedigree‐VAAST (pVAAST)

Page 36: Whole-genome sequencing and the detection of disease

Pedigree‐VAAST (pVAAST)

Incorporates inheritance pattern information directly into the likelihood ratio

LODi: LOD score for this gene for the ith family.

• LOD scores calculated for each potential “causal” allele in a family.

Page 37: Whole-genome sequencing and the detection of disease

VAAST 1.0 Cardiac Septal Defect Results (21,000 genes)

GATA4

Location along the genomeRun time: 24 hours on 42 CPUs

VAA

ST

1.0

log

p-va

lue

Page 38: Whole-genome sequencing and the detection of disease

pVAAST Cardiac Septal Defect Results (21,000 genes)

GATA4

Location along the genomeRun time: 4 hours on 42 CPUs

pVA

AS

Tlo

g p-

valu

e

Genome-wide significance(p=0.05/21000=2.4x10-6)

Page 39: Whole-genome sequencing and the detection of disease

Male‐specific Mutation Rate Estimation

Page 40: Whole-genome sequencing and the detection of disease

Male‐specific Mutation Rate Estimation

Page 41: Whole-genome sequencing and the detection of disease

Male‐specific Mutation Rate Estimation

Page 42: Whole-genome sequencing and the detection of disease

• Identify IBD regions ( ).• Relatives provide paternal and offspring genome

“replicates” to reduce false positives.• 12 mutations in 600 Mb of sequence data.• Estimated male-specific mutation rate of 2.0x10-8 per base pair per

generation (compared to sex-averaged rate of 1.1 x 10-8)• Male mutation rate is ~5x the female rate.

Male‐specific Mutation Rate Estimation

Page 43: Whole-genome sequencing and the detection of disease

VAAST greatly improves gene prioritization compared to standard hard filtering (e.g. ANNOVAR)

Yandell et al. Genome Research, 2011

True

dis

ease

gen

e ra

nk

Average genome-wide rank of the true-disease causing gene for 100 different Mendelian disease genes. Approx. 21,000 genes tested per disease gene. 190

control genomes.

9 3

Page 44: Whole-genome sequencing and the detection of disease

An early estimate of the human mutation rate

• Estimated mutation rate, μ, for hemophilia = 2 x 10-5 per locus per generation (Haldane, 1935, “The rate of spontaneousmutation of a human gene” J. Genet. 31: 21-30)

• If 1kb of factor VIII gene is functionally significant, μ = 2 x 10-8

• Our estimate of μ was 1.1 x 10-8 (Roach et al., 2010)

• Haldane (1947) estimated that the male mutation rate is up to 10 times higher than that of the female

Page 45: Whole-genome sequencing and the detection of disease

New Questions

• Direct estimate of paternal age effect on mutation

• Variation in mutation rates– Variation in DNA repair genes

– Coding vs. various types of non‐coding DNA

– Effects of nearby sequence motifs, indels, etc.

– Population variation

Page 46: Whole-genome sequencing and the detection of disease

Identical NAA10 mutation found in a second family

Rope et al., 2011, Am. J. Hum. Genet.

ERSA (Huff et al., 2011, Genome Res.) shows the two families are unrelated; the mutation is at least 100 generations old

Page 47: Whole-genome sequencing and the detection of disease

VAAST greatly improves gene prioritization compared to standard hard filtering (e.g.

ANNOVAR)

Yandell M et al. Genome Research, 2011

True

dis

ease

gen

e ra

nk

Average genome-wide rank of the true-disease causing gene for 100 different Mendelian disease genes. Approx. 21,000 genes tested per disease gene. 190

control genomes.

9 3

Page 48: Whole-genome sequencing and the detection of disease

Identifying a gene responsible for a lethal X-linked disorder

Rope AF, et al. Am J Hum Genet, 2011.

NAA10

X-chromosome wide significance(p=0.05/729=7x10-5)

Previously undescribed X-linked lethal disordercharacterized by aged appearance, severe developmental delay, and cardiac arrhythmias.

Location along the X-chromosome

Page 49: Whole-genome sequencing and the detection of disease

LOD score needed for genome-wide significance for GATA4

Page 50: Whole-genome sequencing and the detection of disease

Rare SNPs are much more likely to be population-specific

Common SNPs previously identified in dbSNP

New rarer SNPs identified by sequencing

1000 Genomes Project, 2010

Page 51: Whole-genome sequencing and the detection of disease

Mutation rate estimation• 33,937 Mendelian inheritance errors in 1.8

Gb of analyzed sequence in two offspring– Only a few dozen mutations are expected– 99.9% of mutation candidates are

sequencing errors• All were resequenced

– Agilent custom capture array – Illumina GA II sequencing– High-stringency base calling to minimize false

positives• Maq• Binomial method

Page 52: Whole-genome sequencing and the detection of disease

Mutation rate estimation• All but 28 of the Mendelian inheritance errors

were excluded – all 28 confirmed by mass spectrometry

• Estimate the false negative rate by comparing whole-genome sequence with targeted resequencing results– 5 Mb from 25 randomly selected genomic regions

(200 kb each)• Mutation rate = 1.1 x 10-8 (6.8 x 10-9 – 1.7 x

10-8) /nucleotide/generation: each gamete has ~35 new variants

Page 53: Whole-genome sequencing and the detection of disease

Other mutation rate estimatesRate (x 10-8)

Source

2.5 18 processed pseudogenes (16 kb) in human v. chimp

Nachman and Crowell, 2000, Genetics

1.8 20 coding loci (74 kb) Kondrashov, 2003, Hum. Mutat.

1.5 DMD locus (85 kb) Flanigan et al., 2009, Hum. Mutat.

1.3 39 coding loci Lynch, 2010, PNAS

1.4 294 Mb noncoding seq. Awadalla et al., 2010, AJHG

1.0; 1.2 Two resequenced trios 1000 Genomes Project, 2010, Nature

Page 54: Whole-genome sequencing and the detection of disease

ANNOVAR: Wang et al. 2010 Nucl. Acids Res.

Filtering variants

Page 55: Whole-genome sequencing and the detection of disease

CMCM CM

+

+ (in one or more offspring)

--

----

- Candidate mutation absent+ Candidate mutation present

• First exclude all base pairs:– No called in any member of the pedigree– Polymorphic in any other Complete Genomics genome and any 

mutation present in the 2nd and 3rd generation• Next identify all genomic regions IBD between 2nd, 3rd, and 

4th generation using Shared Genomic Segments (700 Mb)

1

2

3

4

Male‐specific mutation rate estimation

Page 56: Whole-genome sequencing and the detection of disease

Direct estimation of the mutation rate in a family quartet

Page 57: Whole-genome sequencing and the detection of disease

CMCM CM

Male‐specific mutation rate estimation

+

+ (in one or more offspring)

--

----

- Candidate mutation absent+ Candidate mutation present

• 11 mutations in 700 Mb of sequence

• Estimated mutation rate: 1.6x10‐8 per nucleotide 

• Estimated ratio of male‐to‐female mutation rates:  2.5

Page 58: Whole-genome sequencing and the detection of disease

Accuracy of mutation rate estimate

0 0,5 1 1,5 2

Prob

ability

Estimated mutation rate (x10‐8)

Miller Syndrome family

0 0,5 1 1,5 2

Prob

ability

Estimated mutation rate (x10‐8)

.. and two 1000 Genomes trios

0 0,5 1 1,5 2

Prob

ability

Estimated mutation rate (x10‐8)

Projected: 20 trios

0 0,5 1 1,5 2

Prob

ability

Estimated mutation rate (x10‐8)

Projected: 20 trios?

Page 59: Whole-genome sequencing and the detection of disease
Page 60: Whole-genome sequencing and the detection of disease

Mutation rate estimation• All but 28 of the Mendelian inheritance

errors were excluded – all 28 confirmed by mass spectrometry

• Estimate the false negative rate by comparing whole-genome sequence with targeted resequencing results– 5 Mb from 25 randomly selected genomic

regions (200 kb each)• Mutation rate = 1.1 x 10-8

/nucleotide/generation: each gamete has 35 new variants

Page 61: Whole-genome sequencing and the detection of disease

Whole-genome genetic variation

• One or more family members differed from NCBI reference sequence (Build 36.1) at 4.5 million positions

• Nucleotide diversity (θ), comparing two parents to reference sequence = 9.5 x 10-4

– Watson and Venter vs. NCBI reference: θ = 9.3 x 10-4

• 323,255 novel SNPs– Family analysis showed that about 20% of

these apparent SNPs are sequencing errors

Page 62: Whole-genome sequencing and the detection of disease

uncalled

one allele called

bothallelescalled

Coverage of Family of Four by Region Type

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Mom Dad

Dau Son

Exome Unique CNVs Repeats (Int/Old)

Repeats (Int/Young)

Repeats (Simple/Old)

Repeats (Simple/Young)

1.8 Gb of data useful for mutation detection

Page 63: Whole-genome sequencing and the detection of disease

Pattern of a de novo mutation

R/R R/R

R/N

R/R R/R

R/N

R: human reference alleleN: non-reference allele

Page 64: Whole-genome sequencing and the detection of disease

“Scrambled trios”

R/R

R/R

R/N

R/R

R/N

R/R

R/R

R/N

R/R

R/N

R/R R/R

R/N

R/RR/R

R/N

R/R R/R

R/R R/R

R/N

R/R

R/N

R/R R/R

R/N

R/R

R/N

R/RR/R

2768 sites matched one of these patterns

At 1832 sites, at least one genotype could not be confirmed

False negative rate = 1832/2768 = 66.2%

Page 65: Whole-genome sequencing and the detection of disease

Controls

Cases

Variant-based vs. feature-based prioritization

What is the likelihood of observing the composite genotype at Gene X in cases given the controls?

Gene X

C/T: 0.8/0.2G/C: 0.6/0.4

C/T: 0.1/0.9G/C: 0.4/0.6C/T: 0.2/0.8, S-> *

Bonferroni correction for number of features (~21,000 human genes).

Page 66: Whole-genome sequencing and the detection of disease

Mutation rate for the factor VIII gene (hemophilia A)

• Frequency, f, of males in London with hemophilia: 4 x 10-5 to 17 x 10-5 (=q)

• Selection coefficient, s, = .75 to .90• Then, µ = sq/3 = 10-5 to 5 x 10-5

• Estimated mutation rate, µ, = 2 x 10-5

per locus per generation

Haldane, JBS, 1935, “The rate of spontaneous mutation of a human gene” J. Genet. 31: 21-30

Page 67: Whole-genome sequencing and the detection of disease

High-resolution human recombination map

• Median resolution of crossover sites was 2.6 kb

• 155 crossovers in four meioses– 98 maternal– 57 paternal

• 92 crossovers (~60%) located within a HapMap-defined recombination hotspot

Page 68: Whole-genome sequencing and the detection of disease

Larger chimp-human ancestral population size accommodates earlier coalescence: more generations in which mutations can occur

Ne = 10,000

Ne = 100,000

Chimp Human Chimp Human

Page 69: Whole-genome sequencing and the detection of disease

Characterizing the mutations• None were known SNPs (dbSNP, 1000

Genomes)• Each was seen in only one of two sibs• Mutations at CpG sites

– Expected = 5.04 of 28 (based on 10x elevation in rate)

– Observed: 5 of 28 • Non-CpG transition/transversion rate: 2.3

Page 70: Whole-genome sequencing and the detection of disease

Reconciling the direct mutation estimate with the phylogenetic estimate

• Mutation rate of 2.5 x 10-8 assumes:– Human-chimp divergence of 5 MYA– Small effective population size of human-

chimp common ancestor (10,000)• Assuming a divergence time of 6 MYA and

effective size of 100,000, rate =1.3 x 10-8