next-generation sequence analysis gabor t. marth boston college biology department psb 2008 january...

48
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Post on 20-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Next-generation sequence analysis

Gabor T. MarthBoston College Biology Department

PSB 2008January 4-8. 2008

Page 2: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Read length and throughput

read length

base

s p

er

mach

ine r

un

10 bp 1,000 bp100 bp

100 Mb

10 Mb

1Mb

1Gb

Illumina/Solexa, AB/SOLiD short-read sequencers

ABI capillary sequencer

454 pyrosequencer(20-100 Mb in 100-250 bp reads)

(1-4 Gb in 25-50 bp reads)

Page 3: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

DNA ligation DNA base extension

Church, 2005

Sequencing chemistries

Page 4: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Template clonal amplification

Church, 2005

Page 5: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Massively parallel sequencing

Church, 2005

Page 6: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Features of NGS data

• Short sequence reads– 100-200bp– 25-35bp (micro-reads)

• Huge amount of sequence per run– Up to gigabases per run

• Huge number of reads per run– Up to 100’s of millions

• Higher error as compared with Sanger sequencing– Error profile different to Sanger

Page 7: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Current and future application areas

Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery

De novo genome sequencing

Short-read sequencing will be (at least) an alternative to micro-arrays for:

• DNA-protein interaction analysis (CHiP-Seq)• novel transcript discovery• quantification of gene expression• epigenetic analysis (methylation profiling)

DELSNP

reference genome

Page 8: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Fundamental informatics challenges

1. Interpreting machine readouts – base calling, base error estimation

2. Dealing with non-uniqueness in the genome: resequenceability

3. Alignment of billions of reads

Page 9: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Informatics challenges (cont’d)

5. Data visualization

4. SNP and short INDEL, and structural variation discovery

6. Data storage & management

Page 10: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Challenge 1. Base accuracy and base calling

• machine read-outs are quite different

• read length, read accuracy, and sequencing error profiles are variable (and change rapidly as machine hardware, chemistry, optics, and noise filtering improves)

• what is the instrument-specific error profile?• are the base quality values satisfactory?

(1) are base quality values accurate? (2) are most called bases high-quality?

Page 11: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

454 pyrosequencer error profile

• multiple bases in a homo-polymeric run are incorporated in a single incorporation test the number of bases must be determined from a single scalar signal the majority of errors are INDELs

• error rates are nucleotide-dependent

Page 12: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

454 base quality values

• the native 454 base caller assigns too low base quality values

Page 13: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

PYROBAYES: determine base number

data likelihood

s

priors

posterior base number probability

New 454 base caller:

Page 14: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

PYROBAYES: base calls and quality values

• call the most likely number of nucleotides

• produce three base quality values: QS (substitution)QI (insertion)QD (deletion)

Page 15: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

PYROBAYES: Performance

• better correlation between assigned and measured quality values

• higher fraction of high-quality bases

Page 16: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Illumina/Solexa base accuracy

• error rate grows as a function of base position within the read

• a large fraction of the reads contains 1 or 2 errors

Page 17: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Illumina/Solexa base accuracy (cont’d)

• Actual base accuracy for a fixed base quality value is a function of base position within the read (i.e. there is need for quality value calibration)

• Most errors are substitutions PHRED quality values work

Page 18: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

3’ 5’

N N N T G z z z

3’ 5’

N N N G A z z z

3’ 5’

N N N A T z z z

2-base, 4-color: 16 probe combinations

● 4 dyes to encode 16 2-base combinations● Detect a single color indicates 4 combinations & eliminates 12 ● Each color reflects position, not the base call● Each base is interrogated by two probes● Dual interrogation eases discrimination

– errors (random or systematic) vs. SNPs (true polymorphisms)

A C G T

A

C

G

T

2nd Base

1st

Bas

e

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

AB SOLiD System dibase sequencing

Page 19: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known.

A C G T

A

C

G

T

2nd Base

1st

Bas

e

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

AA AC AC AA AG AT AA AG AG CC CA CA CC CT CG CC CT CT GG GT GT GG GA GC GG GA GA TT TG TG TT TC TA TT TC TC

A A C A A G C C T C C C A C C T A A G A G G T G G A T T C T T T G T T C G G A G

10 01 2 3 0 2 2

10 01 2 3 0 2 2

4Possible

Sequences

Converting dibase (color) into base calls

Page 20: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Reference

Alignment to reference in “color-space”

• Working in color space:– Reverse-complementation becomes simply reverse– Apply color transition rules to remove measurement errors from

partial assemblies– If reference of Sanger reads are combined, translate to color space

Page 21: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

A C G G T C G T C G T G T G C G T

A C G G T C G T C G T G T G C G T

A C G G T C G C C G T G T G C G T

A C G G T C G T C G T G T G C G T

No change

SNP

Measurementerror

SOLiD error checking code (I)

Page 22: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

G T C encodes

3 Possible Changes in

the Middle Base

3 Possible Changes in

Dibase Encoding

G A C encodes

G C C encodesG G C encodes

Allowed Transitions

Only Some Transitions Indicate a SNP in sample

SOLiD error checking code (II)

Page 23: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

A C G G T C G T C G T G T G C G T

A C G G T C - T C G T G T G C G T

A C G G T C - - C G T G T G C G T

A C G G T C G T C G T G T G C G T

Invalid adjacent

1 base deletion

2 base deletion

SOLiD error checking code (III)

Page 24: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 40

Position on Read

Mea

sure

d Q

V

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

Err

or r

ate

SOLiD di-base sequencing accuracy and QV

Page 25: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Challenge 2. Resequenceability

• Reads from repeats cannot be uniquely mapped back to their true region of origin

• RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

• Near-perfect micro-repeats can be also a problem because we want to align reads even with a few sequencing errors and / or SNPs

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6 7 8

Mismatches

Rea

ds

Page 26: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Repeats at the fragment level

“base masking”

“fragment masking”

Page 27: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Fragment level repeat annotation

0 1 20.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Fra

ctio

n o

f ge

no

me

Number of mismatches allowed

• bases in repetitive fragments may be resequenced with reads representing other, unique fragments fragment-level repeat annotations spare a higher fraction of the genome than base-level repeat masking

Page 28: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Find perfect and near-perfect micro-repeats

• Hash based methods (fast but only work out to a couple of mismatches)• Exact methods (very slow but find every repeat copy)• Heuristic methods (fast but miss a fraction of the repeats)

Page 29: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Challenge 3. Read alignment and assembly

• resequencing requires reference sequence-guided read alignment

• to align billions of reads the aligner has to be fast and efficient• INDEL errors require gapped alignment• individually aligned reads must be “assembled” together• has to work for every read type (short, medium-length, and long reads)• must tolerate sequencing errors and SNPs• must work with both base-level and fragment-level repeat annotations

• transcribed sequences require additional features e.g. splice-site aware alignment capability

• most frequently used tools: BLAT (only pair-wise), SSAHA (pair-wise), MAQ (pair-wise and assembly), ELAND (pair-wise), MOSAIK (pair-wise and assembly, gapped)

Page 30: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

MOSAIK: method

Step 1. initial short-hash based scan for possible read locations

Step 2. evaluation of candidate read locations with SW method

Page 31: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

MOSAIK – performance

• Solexa read alignments to C. elegans genome:100 million reads aligned in 95 minutes18,000 reads / second

• 454 reads to Pichia (yeast-size) genomeGS20: 2,000 reads / secondFLX: 300 reads / second

• Solexa read alignments to masked human genome:40 seconds for 1 million reads 18,000 reads / second5.5 GB RAM used (more for longer initial hash sizes)

Page 32: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

MOSAIK: co-assembling different read types

ABI/cap.

454/FLX

Illumina

454/GS20

Page 33: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Challenge 4. Polymorphism discovery

• shallow and deep read coverage

• most candidates will never be “checked” only very low error rates are acceptable

• we updated PolyBayes to deal with new read types • made the new software (PBSHORT) much more efficient

Page 34: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Structural variation discovery

• copy number variations (deletions & amplifications) can be detected from variations in the depth of read coverage

• structural rearrangements (inversions and translocations) require paired-end read data

Page 35: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Challenge 5. Data visualization

1. aid software development: integration of trace data viewing, fast navigation, zooming/panning

2. facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays

3. promote hypothesis generation: integration of annotation tracks

Page 36: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Challenge 6. Massive data volumes

Short-read format working [email protected](Asim Siddiqui, UBC)

Assembly format working groupBoston College

http://assembly.bc.edu

• two connected working groups to define standard data formats

Page 37: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Next-generation sequencing software

http://bioinformatics.bc.edu/marthlab/Mosaik

http://sourceforge.net/projects/maq/

Machine manufacturers’ sites plus third-party developers’ sites, e.g.:

Page 38: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Applications in various discovery projects

1. SNP discovery in shallow, single-read 454 coverage(Drosophila melanogaster)

2. Mutational profiling in deep 454 data(Pichia stipitis)

3. SNP and INDEL discovery in deep Illumina / Solexa short-read coverage(Caenorhabditis elegans)

(image from Nature Biotech.)

Page 39: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

SNP calling in single-read 454 coverage

• collaborative project with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)• goal was to assess polymorphism rates between 10 different African and American melanogaster isolates• 10 runs of 454 reads (~300,000 reads per isolate) were collected• key informatics question: can we detect SNPs with high accuracy in low-coverage, survey-style 454 reads aligned to finished reference genome sequence?

DNA courtesy of Chuck Langley, UC Davis

• reads were base-called with PyroBayes and aligned to the 180Mb reference melanogaster genome sequence with Mosaik 0.16 x nominal read coverage most reads are singletons• SNP detection with PolyBayes

Page 40: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

SNP calling success rates

iso-1 reference

46-2 454 read

46-2 ABI reads (2 fwd + 2 rev)

• 92.9 % validation rate (1,342 / 1,443)single-read coverage: 92.9% (1,275 /

1,372 )double-read coverage: 94.3% (67 / 71)

• 2.0% missed SNP rate (25 / 1247)single-read coverage: 2.12% (25 /

1176)double-read coverage: 0% (0 / 59)

Page 41: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Genome variation in melanogaster isolates

• 658,280 SNPs discovered among all 10 lines.

• Nucleotide diversity Ѳ ≈ 5x10-3 (1 SNP / 200 bp) between each line and reference (in line with expectations).

• 20.2% (133,264 sites) polymorphic among two or more lines. The 1 SNP / 900 bp nominal density is sufficient for high-resolution marker mapping

Page 42: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

SNP calling in short-read coverage

C. elegans reference genome (Bristol, N2 strain)

Pasadena, CB4858(1 ½ machine runs)

Bristol, N2 strain(3 ½ machine runs)

• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes

• 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University

• primary aim was to detect polymorphisms between the Pasadena and the Bristol strain

Page 43: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Polymorphism discovery in C. elegans

• SNP calling error rate very low:

Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)

SNP

INS

• INDEL candidates validate and convert at similar rates to SNPs:

Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)

• MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU)• PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster) • SNP density: 1 in 1,630 bp (of non-repeat genome sequence)

Page 44: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Mutational profiling: deep 454/Illumina/SOLiD data

• collaboration with Doug Smith at Agencourt

• Pichia stipitis converts xylose to ethanol (bio-fuel production)

• one mutagenized strain had especially high conversion

efficiency

• determine where the mutations were that caused this

phenotype

• we resequenced the 15MB genome with 454 Illumina, and

SOLiD reads

• 14 true point mutations in the entire genome

Pichia stipitis reference sequence

Image from JGI web site

Page 45: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Mutational profiling: comparisons

Technology Coverage Nominal coverage FP FN Total error

454/FLX 2 runs 12.9x 1 0 1

454/FLX 1 run 9.8x 6 1 7

Illumina 7 lanes 53.5x 0 0 0

Illumina 3 lanes 23.4x 0 0 0

Illumina 2 lanes 15.6x 2 0 2

Illumina 1 lane 7.6x 2 2 2

SOLiD - 30.0X 0 0 0

SOLiD - 20.0X 0 0 0

SOLiD - 10.0X 0 0 0

SOLiD - 8.0X 0 4 4

SOLiD - 6.0X 0 6 6

Page 46: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Informatics of transcriptome sequencing

0

2000

4000

6000

8000

10000

12000

1 2-25 25-50 50-75 75-100 101-200 201-300 301-400 401-500 >500

Fre

qu

en

cy

Count Of Sage Tag

Counts Per Transcript Based On SAGE Data Of C. elegans Adult Worm (Jones et al. 2001, GEO 24438)

• measuring gene expression levels by sequence tag counting requires SAGE informatics-like approaches

• novel transcript discovery

Inferred Exon 1 Inferred Exon 2

Inferred Exon 1 Inferred Exon 2

new genes & exons

novel transcripts in known genes

Page 47: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Protein-DNA interactions: CHiP-Seq

Protein-bound DNA fragments are isolated with chromatin immunoprecipitation (ChIP) and then sequenced (Seq) on a high-throughput sequencing platform. Sequences are mapped to the genome sequence with a read alignment program. Regions over-represented in the sequences are identified.

Johnson et al. Science, 2007

Page 48: Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Protein-DNA interactions: CHIP-SEQ

Mikkelsen et al. Nature 2007.

ChIP-Seq scales well for simultaneous analysis of binding sites in the entire genome.