high throughput sequencing technologies: on the path to the $0* genome

High Throughput Sequencing Technologies: On the path to the $0* Genome

Brian Krueger, PhDDuke University

Center for Human Genome Variation

Chromatin Basics

1) 1400nm - Metaphase Chromosome 2) 700nm - Condensed Chromosome 3) 300nm - Extended Condensed Chromosome 4) 30nm – Packed nucleosomes5) 11nm – Nucleosome string6) 2nm – DNA double Helix

6

12

34

5

Image credit: Nature Education

• Chromatin is the DNA packing material

• Two forms– Euchromatin

• Open and actively transcribed

– Heterochromatin• Packed and not producing

RNA

DNA Basics

Credit: Wikimedia Commons

• DNA is made of sugar phosphate bases

– Purines• Adenine• Guanine

– Pyrimidines• Cytosine• Thymine

• Sequence of bases determines when and what proteins are made

Gene Expression – Enhancers/Promoters


• DNA is converted into useable information in a process called transcription

– Enhancers• Serve as accessory beacons that

bind proteins involved in regulating gene expression

• Help the polymerase “find” where a gene is located in the chromatin

– Promoter• Located just upstream of the

transcription start site• Staging site for the polymerase

transcription factors that create mRNA – RNA polymerase II

– Transcription start site• First transcribed base of mRNA

sequence

Gene Expression – Transcription/Translation


• DNA is composed of Exons and Introns– Exons are protein coding regions of DNA– Introns are noncoding regions of DNA that

must be removed during transcription to produce mature mRNA

• Introns removed during transcription by the RNA spliceosome

– Sequence dependent process• Mature mRNA is capped (methylated) and

a poly-adenine tail is added for stability• Sequence exported to the cytoplasm for

translation and protein production

• Mutations to the DNA can negatively affect every step of this process!

Chromosome

Common DNA MutationsSequence

vari

ants

Str

uct

ura

l vari

ants

Single nucleotide variant

Small insertion

Small deletion

Deletion

Translocation

Reference

A B C DATCGGGTCATGTCA

ATCGGGTCATATCA

A B C D

ATCGGGTCATGACGTCA

A B C D

ATCGGGTCAT

A B C D

A C D

A B GE

Duplication

A B C DC

Inversion A B

D C

F

Credit: Elizabeth Ruzzo, PhD, CHGV

Common DNA Mutations

• Effects– No effect– Too much protein– Too little protein– No protein– Not the right protein

Image Credit: Cooper et al. Nat Rev Genet

• Site of Mutation Matters– Exons– RNA splice sites– Enhancers– Promoters– 5’ and 3’ UTR regulatory regions

Splice variant

We’re all mutants! Your genome has 4 million single nucleotide variants and 700,000 insertions/deletions!Luckily, the genome is 3 billion base pairs and only 2% of those bases code for protein

• Mutations/Variations can be detected using DNA sequencing

– First invented in the mid 1970s– Two very similar methods developed– Maxam-Gilbert Sequencing

• Chemical modification and cleavage paired with gel electrophoresis

• DNA is 5’ labeled with radioactivity• Exposed to chemical agents that cause specific DNA

breaks• Run on a gel and the pattern reveals which base is at

each site– Sanger Sequencing

• Dideoxy DNA sequencing paired with gel electrophoresis

• DNA is 5’ labeled with radioactivity• Small amount of Dideoxy base added to 4 separate

primer extension reactions• Run on a gel to determine bases at each position by

size

DNA Sequencing

Maxam-GilbertSanger

X

No 3’-OH,No Extension!


• Sanger sequencing– Beat Maxam-Gilbert Sequencing as the method

of choice – Became fully automated

• Dideoxy bases replaced with fluorescently labeled dideoxy bases (1 reaction now instead of 4)

• Liquid chromatography replaces gel electrophoresis

• Lasers and computers replace graduate students and postdocs

• By far the dominant sequencing method up until 2007 – 30 years!

– Still considered the gold standard for validating sequencing data

• Huge limitations for genome wide sequencing because Sanger can only be used to sequence one fragment per Sanger reaction

First Generation Sequencing Technology


• Done using Sanger sequencing…• Took 10 years to complete• Cost $3 billion dollars• Used a technique called hierarchical whole genome shotgun

sequencing– Shotgun Sequencing also invented by Frederick Sanger– Genome fragmented into 200-400kb fragments– Genome fragments cloned into over 30,000 bacmid libraries– Libraries were then fragmented– Sanger sequencing performed – Genome assembled using computers to line up over lapping

sequences

• Most human genome sequencing today is done using whole genome shotgun sequencing!

Human Genome Sequencing

Hierarchical Shotgun Sequencing


• Developed to increase throughput of Sanger sequencing• Can sequence many molecules in parallel

– Does not require homogenous input– DNA sequenced as clusters or in nanowells– Single machine can sequence 3-10 Billion independent DNA

fragments AT THE SAME TIME!– Single Sanger Sequencer maxes out at 1152 reactions per

machine

• Time from DNA to genome reduced from 10 years to 1 day!

Second Generation Sequencing

Illumina HiSeq (3-9 billion clusters – 600GB-1.8TB)

Ion Torrent Proton(100 - 300 million nanowells -

20 - 60GB)

2nd Gen: Sequencing by Synthesis Overview

Align reads to a reference genome

Fragmented DNALigate Adaptors

Add Bases

ImageCleave

Wash Wash

Bind Library and create clusters

Sequencing Cycle

Repeat Hundreds of times on billions of

clusters(1:20)

Genomic DNA

Mutation Calling/Filtering

Variant calling

Visual Inspection

Cross-checking public databases

Sanger sequencing confirmation

Exome Variant Server 6500 exome sequenced individuals

Detecting Copy Number Variants

heterozygousdeletion

homozygousdeletion

duplication

Windows

ERDS (Estimation by Read Depth with SNvs) Average read depth (RD) of every 2-kb window were calculated, followed by GC corrections. A paired Hidden Markov model was applied to infer copy numbers of every window by utilizing both RD information and heterozygosity information.

Flavors of Sequencing

• Whole Genome Sequencing– Obtain whole blood or tissue sample– Create sequencing libraries of all DNA

fragments• Whole Exome Sequencing

– Utilizes a selection protocol– Attach complimentary RNA or DNA strands to

beads– Fish out ONLY coding DNA sequences– Create sequencing libraries from enriched DNA– Reduces cost and analysis time

• Custom Capture– Same protocol as Exome sequencing– Only target desired DNA sequences

• Amplicon Sequencing– Use PCR to amplify target DNA– Sequence amplified DNA (Amplicon)

Disadvantages of 2nd Generation Tech

• Rely on amplification to create libraries and clusters– All polymerases have an inherent error rate (10-6-10-7)– Errors introduced every 10 million to 100 million bases– Secondary validation of variants is key

• Short reads cannot be used for De novo genome assembly

– 2nd Generation sequencers have a maximum read length of 400bp

– This is too short to span long repeat regions– Not good for detecting trinucleotide repeat

expansions ex: fragile X, Huntington’s, spinocerebellar ataxias

• Short reads can miss large structural variations– Genome Translocations and inversions likely will be

missed– Require significant read depth at break points for

these variations to be detected• Trouble detecting small insertions and deletions

– Short reads computationally hard to align and call

• Very high quality single molecule long reads would fix many of these problems!

A

CD

GE FA

A B C DB B

A B C DB B BB B

A B C DBB B

X

X

• Defined as single molecule sequencing• Less complex sample prep and much longer read length

(1-100kb) compared to 200-400bp for 2nd Gen• Two categories

– Sequencing by synthesis• Pioneered by Pacific Biosciences• Sequencer uses super microscopes and polymerase bound

nanowells to WATCH DNA as it is sequenced in real time• Nanowells filled with DNA bases• Fluorescence of base only detected at the polymerase

– Direct sequencing by passing DNA through a nanopore• Bases fed through a membrane bound nanopore• Ionic difference between both sides of the membrane• Detect how ion flow changes at the pore as each base passes

through

• Bleeding edge technology– Many technical hurdles with very high error rates (10-25%)– Very expensive technology

• Costs 3-10x as much as Illumina to do whole genome sequencing

– Short/Long read hybrid proposed to leverage the base accuracy of 2nd gen sequencing and the length of 3rd gen

• Use long reads as a scaffold and correct the errors with short reDS

The Future: Third Generation Sequencing

PacBio

Oxford Nanopore

Costs Associated with Clinical Sequencing

Whole Genome Exome Custom Capture Amplicon

Size (GB) 100 12.5 0.13-1 0.03-0.13

Preparation $400 $200 $80 $40

Sequencing $4,300 $400 $12-100 $1-12

Data Processing/Storage $350 $200 $50 $25

Clincal Review $5,000-10,000 $2,000-6000 $700-2000 $400-900

Total $10,000-15,000 $2,800-6,800 $1,000-2,000 $500-1,000

DNA sequencing costs are falling, but analysis and clinical review cost will likely remain stable for the foreseeable future

New sequencing technology announced this year should reduce the cost of preparing and sequencing a whole genome to $1000 starting in mid 2014 (Does not include Analysis and Review)

How will we ever get to the $0* genome?!?!

Sequencing Costs in the Genome Era

Image credit: NIH

HG DraftHG Final


Image credit: NIH

Sanger Sanger – HGP High

HG DraftHG Final


Image credit: NIH

Sanger

Roche/454 IlluminaABI SolidHelicos

Sanger – HGP High

HG DraftHG Final

Sequencing in the Genome Era: 2008-2010

• The Dawn of the Second Generation Sequencers– Roche 454 - 2007

• Imaging based pyrosequencing • Camera detects pyrophosphate release after each base is added

to nanowells – Bright dot = Base present– ABI Solid - 2007

• Dye tagged fragment ligation• Imaging based• Complicated detection scheme using “color space”

– Illumina - 2008• Imaging based reversible dye termination sequencing• Camera detects fluorescently labeled bases in each cluster –

Color determines base– Helicos (3rd Gen) - 2009

• First “single molecule” sequencer – Third generation sequencing

• Plagued with problems• BUT the fear that it might work helped drive down costs

454

IlluminaGAIIx

ABI Solid 3

Helicos


Image credit: NIH

Sanger


Sanger – HGP High

HG DraftHG Final


Image credit: NIH

Illumina

Sanger


Sanger – HGP High

HG DraftHG Final


• The death of the competition– Illumina

• Release of the HiSeq• Drastically increases output 10x over the GAIIx

– Roche 454• Release 454 titanium and 454 Junior • Used primarily for microbes because it can sequence 400bp and

do de novo assembly of these small organisms• Expensive and error prone• Roche will phase out the 454 family in 2014

– ABI Solid• Never caught on• Expensive, error prone, complicated sample prep

– Helicos• Filed for bankruptcy 2011

• Costs remain level because Illumina has no competition

Illumina HiSeq 2000


Image credit: NIH

Illumina

Sanger


Sanger – HGP High

HG DraftHG Final


Image credit: NIH

Illumina

IlluminaComplete GenomicsIon TorrentPacBioNanopore

Sanger


Sanger – HGP High

HG DraftHG Final


• New Contenders– Complete Genomics

• Proprietary tech and generate data in-house• Competitive pricing with Illumina sequencing

– Pacific Biosciences (3rd Gen)• Announce the PacBio RS• Promise high base accuracy, single molecule sequencing with reads

reaching up to 20kb– Ion Torrent

• Same sequencing methodology as the Roche 454 system • Difference is that it detects the release of H+ after bases are added• Removes need for time consuming imaging steps• Promise a $1000 genome

– Oxford Nanopore (3rd Gen)• Announce MinIon and GridIon• Promise very cheap single molecule sequencing that can be done on

a thumb drive

• Promising competition forces price reductions

PacBio RS

Ion Torrent Proton

Nanopore MinIon


Image credit: NIH

Illumina


Sanger


Sanger – HGP High

HG DraftHG Final



Image credit: NIH

Sanger


Sanger – HGP High

IlluminaIllumina

HG DraftHG Final

Sequencing in the Genome Era: 2012-Present

• New Contenders Fail - Mostly– Complete Genomics

• Not embraced by the research community and serves the diagnostic niche– Pacific Biosciences

• Didn’t deliver on promises – 15% error rate, shorter reads (1-10kb)• Slowly improving – reduced error rate to 5-10%, reads reaching 20-50kb

– Ion Torrent• Didn’t deliver on promises - Low data output, expensive• Serves niche diagnostic market where speed is more valuable than cost or

amount of data output• 60GB PII chip has been “coming” since 2012 – Slated for late 2014 release

– Oxford Nanopore• Finally released first data in 2014• Full of errors and looks like proof of concept tech

– Illumina• Release NextSeq500 for the diagnostic market to kill Ion Torrent• Release the HiSeqX which can sequence a human genome for $1000 to kill

Complete Genomics (1.8TB of output in 3 days! – 16 genomes)– HiSeqX MUST be purchased as a 10 pack ($10 million)– Contractually forced to ONLY use the HiSeqX for genomes

• Prices remain steady 2012-14 because the competition can’t deliver

Releases $1000 genome sequencer,Only lets rich people use it.

Hat image: chasesocal, Deviant Art

The Promise of the $0* Genome

• HiSeqX brings clinical genome cost down to $6-10K (mid 2014)

• Hurdles for the $0* Genome– *Relies on health insurance companies or

governments paying most of the bill– Clinician Education

• Many clinicians do not understand genetic data or how to use it to affect patient care

– Proof of widely applicable value• Genome sequences for MOST people not very

informative• Need more population wide data to accurately

predict how variants outside of coding regions contribute to disease

• Currently used in cancer, neonatal, fertility and undiagnosed disease diagnostics

– Cost reduction• Cost of the all-in test needs to be <$5000• Similar to other high diagnostic value, high tech

tests such as PET, CT, and MRI scans• Likely to happen with streamlined analysis pipelines

Improvements over the next few years will cause more insurance companies

to approve payment on whole genome diagnostics

high throughput sequencing technologies: on the path to the $0* genome

Science