high throughput sequencing technologies: on the path to the $0* genome
DESCRIPTION
Presented to freshman at Duke University on April 7, 2014 - Includes detailed slide notes that loosely follow what I said in the lecture.TRANSCRIPT
High Throughput Sequencing Technologies: On the path to the $0* Genome
Brian Krueger, PhDDuke University
Center for Human Genome Variation
Chromatin Basics
1) 1400nm - Metaphase Chromosome 2) 700nm - Condensed Chromosome 3) 300nm - Extended Condensed Chromosome 4) 30nm – Packed nucleosomes5) 11nm – Nucleosome string6) 2nm – DNA double Helix
6
12
34
5
Image credit: Nature Education
• Chromatin is the DNA packing material
• Two forms– Euchromatin
• Open and actively transcribed
– Heterochromatin• Packed and not producing
RNA
DNA Basics
Credit: Wikimedia Commons
• DNA is made of sugar phosphate bases
– Purines• Adenine• Guanine
– Pyrimidines• Cytosine• Thymine
• Sequence of bases determines when and what proteins are made
Gene Expression – Enhancers/Promoters
Image credit: Nature Education
• DNA is converted into useable information in a process called transcription
– Enhancers• Serve as accessory beacons that
bind proteins involved in regulating gene expression
• Help the polymerase “find” where a gene is located in the chromatin
– Promoter• Located just upstream of the
transcription start site• Staging site for the polymerase
transcription factors that create mRNA – RNA polymerase II
– Transcription start site• First transcribed base of mRNA
sequence
Gene Expression – Transcription/Translation
Image credit: Nature Education
• DNA is composed of Exons and Introns– Exons are protein coding regions of DNA– Introns are noncoding regions of DNA that
must be removed during transcription to produce mature mRNA
• Introns removed during transcription by the RNA spliceosome
– Sequence dependent process• Mature mRNA is capped (methylated) and
a poly-adenine tail is added for stability• Sequence exported to the cytoplasm for
translation and protein production
• Mutations to the DNA can negatively affect every step of this process!
Chromosome
Common DNA MutationsSequence
vari
ants
Str
uct
ura
l vari
ants
Single nucleotide variant
Small insertion
Small deletion
Deletion
Translocation
Reference
A B C DATCGGGTCATGTCA
ATCGGGTCATATCA
A B C D
ATCGGGTCATGACGTCA
A B C D
ATCGGGTCAT
A B C D
A C D
A B GE
Duplication
A B C DC
Inversion A B
D C
F
Credit: Elizabeth Ruzzo, PhD, CHGV
Common DNA Mutations
• Effects– No effect– Too much protein– Too little protein– No protein– Not the right protein
Image Credit: Cooper et al. Nat Rev Genet
• Site of Mutation Matters– Exons– RNA splice sites– Enhancers– Promoters– 5’ and 3’ UTR regulatory regions
Splice variant
We’re all mutants! Your genome has 4 million single nucleotide variants and 700,000 insertions/deletions!Luckily, the genome is 3 billion base pairs and only 2% of those bases code for protein
• Mutations/Variations can be detected using DNA sequencing
– First invented in the mid 1970s– Two very similar methods developed– Maxam-Gilbert Sequencing
• Chemical modification and cleavage paired with gel electrophoresis
• DNA is 5’ labeled with radioactivity• Exposed to chemical agents that cause specific DNA
breaks• Run on a gel and the pattern reveals which base is at
each site– Sanger Sequencing
• Dideoxy DNA sequencing paired with gel electrophoresis
• DNA is 5’ labeled with radioactivity• Small amount of Dideoxy base added to 4 separate
primer extension reactions• Run on a gel to determine bases at each position by
size
DNA Sequencing
Maxam-GilbertSanger
X
No 3’-OH,No Extension!
Credit: Wikimedia Commons
• Sanger sequencing– Beat Maxam-Gilbert Sequencing as the method
of choice – Became fully automated
• Dideoxy bases replaced with fluorescently labeled dideoxy bases (1 reaction now instead of 4)
• Liquid chromatography replaces gel electrophoresis
• Lasers and computers replace graduate students and postdocs
• By far the dominant sequencing method up until 2007 – 30 years!
– Still considered the gold standard for validating sequencing data
• Huge limitations for genome wide sequencing because Sanger can only be used to sequence one fragment per Sanger reaction
First Generation Sequencing Technology
Credit: Wikimedia Commons
• Done using Sanger sequencing…• Took 10 years to complete• Cost $3 billion dollars• Used a technique called hierarchical whole genome shotgun
sequencing– Shotgun Sequencing also invented by Frederick Sanger– Genome fragmented into 200-400kb fragments– Genome fragments cloned into over 30,000 bacmid libraries– Libraries were then fragmented– Sanger sequencing performed – Genome assembled using computers to line up over lapping
sequences
• Most human genome sequencing today is done using whole genome shotgun sequencing!
Human Genome Sequencing
Hierarchical Shotgun Sequencing
Credit: Wikimedia Commons
• Developed to increase throughput of Sanger sequencing• Can sequence many molecules in parallel
– Does not require homogenous input– DNA sequenced as clusters or in nanowells– Single machine can sequence 3-10 Billion independent DNA
fragments AT THE SAME TIME!– Single Sanger Sequencer maxes out at 1152 reactions per
machine
• Time from DNA to genome reduced from 10 years to 1 day!
Second Generation Sequencing
Illumina HiSeq (3-9 billion clusters – 600GB-1.8TB)
Ion Torrent Proton(100 - 300 million nanowells -
20 - 60GB)
2nd Gen: Sequencing by Synthesis Overview
Align reads to a reference genome
Fragmented DNALigate Adaptors
Add Bases
ImageCleave
Wash Wash
Bind Library and create clusters
Sequencing Cycle
Repeat Hundreds of times on billions of
clusters(1:20)
Genomic DNA
Mutation Calling/Filtering
Variant calling
Visual Inspection
Cross-checking public databases
Sanger sequencing confirmation
Exome Variant Server 6500 exome sequenced individuals
Detecting Copy Number Variants
heterozygousdeletion
homozygousdeletion
duplication
Windows
ERDS (Estimation by Read Depth with SNvs) Average read depth (RD) of every 2-kb window were calculated, followed by GC corrections. A paired Hidden Markov model was applied to infer copy numbers of every window by utilizing both RD information and heterozygosity information.
Flavors of Sequencing
• Whole Genome Sequencing– Obtain whole blood or tissue sample– Create sequencing libraries of all DNA
fragments• Whole Exome Sequencing
– Utilizes a selection protocol– Attach complimentary RNA or DNA strands to
beads– Fish out ONLY coding DNA sequences– Create sequencing libraries from enriched DNA– Reduces cost and analysis time
• Custom Capture– Same protocol as Exome sequencing– Only target desired DNA sequences
• Amplicon Sequencing– Use PCR to amplify target DNA– Sequence amplified DNA (Amplicon)
Disadvantages of 2nd Generation Tech
• Rely on amplification to create libraries and clusters– All polymerases have an inherent error rate (10-6-10-7)– Errors introduced every 10 million to 100 million bases– Secondary validation of variants is key
• Short reads cannot be used for De novo genome assembly
– 2nd Generation sequencers have a maximum read length of 400bp
– This is too short to span long repeat regions– Not good for detecting trinucleotide repeat
expansions ex: fragile X, Huntington’s, spinocerebellar ataxias
• Short reads can miss large structural variations– Genome Translocations and inversions likely will be
missed– Require significant read depth at break points for
these variations to be detected• Trouble detecting small insertions and deletions
– Short reads computationally hard to align and call
• Very high quality single molecule long reads would fix many of these problems!
A
CD
GE FA
A B C DB B
A B C DB B BB B
A B C DBB B
X
X
• Defined as single molecule sequencing• Less complex sample prep and much longer read length
(1-100kb) compared to 200-400bp for 2nd Gen• Two categories
– Sequencing by synthesis• Pioneered by Pacific Biosciences• Sequencer uses super microscopes and polymerase bound
nanowells to WATCH DNA as it is sequenced in real time• Nanowells filled with DNA bases• Fluorescence of base only detected at the polymerase
– Direct sequencing by passing DNA through a nanopore• Bases fed through a membrane bound nanopore• Ionic difference between both sides of the membrane• Detect how ion flow changes at the pore as each base passes
through
• Bleeding edge technology– Many technical hurdles with very high error rates (10-25%)– Very expensive technology
• Costs 3-10x as much as Illumina to do whole genome sequencing
– Short/Long read hybrid proposed to leverage the base accuracy of 2nd gen sequencing and the length of 3rd gen
• Use long reads as a scaffold and correct the errors with short reDS
The Future: Third Generation Sequencing
PacBio
Oxford Nanopore
Costs Associated with Clinical Sequencing
Whole Genome Exome Custom Capture Amplicon
Size (GB) 100 12.5 0.13-1 0.03-0.13
Preparation $400 $200 $80 $40
Sequencing $4,300 $400 $12-100 $1-12
Data Processing/Storage $350 $200 $50 $25
Clincal Review $5,000-10,000 $2,000-6000 $700-2000 $400-900
Total $10,000-15,000 $2,800-6,800 $1,000-2,000 $500-1,000
DNA sequencing costs are falling, but analysis and clinical review cost will likely remain stable for the foreseeable future
New sequencing technology announced this year should reduce the cost of preparing and sequencing a whole genome to $1000 starting in mid 2014 (Does not include Analysis and Review)
How will we ever get to the $0* genome?!?!
Sequencing Costs in the Genome Era
Image credit: NIH
HG DraftHG Final
Sequencing Costs in the Genome Era
Image credit: NIH
Sanger Sanger – HGP High
HG DraftHG Final
Sequencing Costs in the Genome Era
Image credit: NIH
Sanger
Roche/454 IlluminaABI SolidHelicos
Sanger – HGP High
HG DraftHG Final
Sequencing in the Genome Era: 2008-2010
• The Dawn of the Second Generation Sequencers– Roche 454 - 2007
• Imaging based pyrosequencing • Camera detects pyrophosphate release after each base is added
to nanowells – Bright dot = Base present– ABI Solid - 2007
• Dye tagged fragment ligation• Imaging based• Complicated detection scheme using “color space”
– Illumina - 2008• Imaging based reversible dye termination sequencing• Camera detects fluorescently labeled bases in each cluster –
Color determines base– Helicos (3rd Gen) - 2009
• First “single molecule” sequencer – Third generation sequencing
• Plagued with problems• BUT the fear that it might work helped drive down costs
454
IlluminaGAIIx
ABI Solid 3
Helicos
Sequencing Costs in the Genome Era
Image credit: NIH
Sanger
Roche/454 IlluminaABI SolidHelicos
Sanger – HGP High
HG DraftHG Final
Sequencing Costs in the Genome Era
Image credit: NIH
Illumina
Sanger
Roche/454 IlluminaABI SolidHelicos
Sanger – HGP High
HG DraftHG Final
Sequencing in the Genome Era: 2010-2011
• The death of the competition– Illumina
• Release of the HiSeq• Drastically increases output 10x over the GAIIx
– Roche 454• Release 454 titanium and 454 Junior • Used primarily for microbes because it can sequence 400bp and
do de novo assembly of these small organisms• Expensive and error prone• Roche will phase out the 454 family in 2014
– ABI Solid• Never caught on• Expensive, error prone, complicated sample prep
– Helicos• Filed for bankruptcy 2011
• Costs remain level because Illumina has no competition
Illumina HiSeq 2000
Sequencing Costs in the Genome Era
Image credit: NIH
Illumina
Sanger
Roche/454 IlluminaABI SolidHelicos
Sanger – HGP High
HG DraftHG Final
Sequencing Costs in the Genome Era
Image credit: NIH
Illumina
IlluminaComplete GenomicsIon TorrentPacBioNanopore
Sanger
Roche/454 IlluminaABI SolidHelicos
Sanger – HGP High
HG DraftHG Final
Sequencing in the Genome Era: 2010-2011
• New Contenders– Complete Genomics
• Proprietary tech and generate data in-house• Competitive pricing with Illumina sequencing
– Pacific Biosciences (3rd Gen)• Announce the PacBio RS• Promise high base accuracy, single molecule sequencing with reads
reaching up to 20kb– Ion Torrent
• Same sequencing methodology as the Roche 454 system • Difference is that it detects the release of H+ after bases are added• Removes need for time consuming imaging steps• Promise a $1000 genome
– Oxford Nanopore (3rd Gen)• Announce MinIon and GridIon• Promise very cheap single molecule sequencing that can be done on
a thumb drive
• Promising competition forces price reductions
PacBio RS
Ion Torrent Proton
Nanopore MinIon
Sequencing Costs in the Genome Era
Image credit: NIH
Illumina
IlluminaComplete GenomicsIon TorrentPacBioNanopore
Sanger
Roche/454 IlluminaABI SolidHelicos
Sanger – HGP High
HG DraftHG Final
IlluminaComplete GenomicsIon TorrentPacBioNanopore
Sequencing Costs in the Genome Era
Image credit: NIH
Sanger
Roche/454 IlluminaABI SolidHelicos
Sanger – HGP High
IlluminaIllumina
HG DraftHG Final
Sequencing in the Genome Era: 2012-Present
• New Contenders Fail - Mostly– Complete Genomics
• Not embraced by the research community and serves the diagnostic niche– Pacific Biosciences
• Didn’t deliver on promises – 15% error rate, shorter reads (1-10kb)• Slowly improving – reduced error rate to 5-10%, reads reaching 20-50kb
– Ion Torrent• Didn’t deliver on promises - Low data output, expensive• Serves niche diagnostic market where speed is more valuable than cost or
amount of data output• 60GB PII chip has been “coming” since 2012 – Slated for late 2014 release
– Oxford Nanopore• Finally released first data in 2014• Full of errors and looks like proof of concept tech
– Illumina• Release NextSeq500 for the diagnostic market to kill Ion Torrent• Release the HiSeqX which can sequence a human genome for $1000 to kill
Complete Genomics (1.8TB of output in 3 days! – 16 genomes)– HiSeqX MUST be purchased as a 10 pack ($10 million)– Contractually forced to ONLY use the HiSeqX for genomes
• Prices remain steady 2012-14 because the competition can’t deliver
Releases $1000 genome sequencer,Only lets rich people use it.
Hat image: chasesocal, Deviant Art
The Promise of the $0* Genome
• HiSeqX brings clinical genome cost down to $6-10K (mid 2014)
• Hurdles for the $0* Genome– *Relies on health insurance companies or
governments paying most of the bill– Clinician Education
• Many clinicians do not understand genetic data or how to use it to affect patient care
– Proof of widely applicable value• Genome sequences for MOST people not very
informative• Need more population wide data to accurately
predict how variants outside of coding regions contribute to disease
• Currently used in cancer, neonatal, fertility and undiagnosed disease diagnostics
– Cost reduction• Cost of the all-in test needs to be <$5000• Similar to other high diagnostic value, high tech
tests such as PET, CT, and MRI scans• Likely to happen with streamlined analysis pipelines
Improvements over the next few years will cause more insurance companies
to approve payment on whole genome diagnostics