shrimp: the short read mapping package michael brudno department of computer science university of...

23
SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

Upload: lynne-pitts

Post on 18-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

SHRiMP: The SHort Read Mapping Package

Michael BrudnoDepartment of Computer Science

University of Toronto 11/09/08

Page 2: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

Handling NGS Data

• NGS: at least 3 distinct read types:– Illumina/Solexa, 454

letter-space

– AB SOLiD color-space (di-base sequencing)

– 2-pass SMS (Helicos) 2 reads, same location higher error rates

• Need new algorithms– SOLiD: Biologists want letters, not colors– 2-pass: How to best handle two reads?

Page 3: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

SHRiMP Overview

Isolate similarity in stages:

1. Spaced Seed Filtering

2. Vectorized Smith-Waterman

3. Full Alignment– Specialized for SOLiD, 2-pass, Letter-space

4. Compute p-values (and other statistics)

} Common

Page 4: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

Outline

1. AB SOLiD Reads

2. 2-pass (SMS) Reads

Page 5: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

TGAGCGTTC|||TGAATAGGA

A C G T

A 0 1 2 3

C 1 0 3 2

G 2 3 0 1

T 3 2 1 0

AB SOLiD: Dibase Sequencing

AB SOLiD reads look like this:

T012233102

A G

C T

1

2

2

33

0 0

00

1

TGAGCGTTCT012033102TGAATAGGA

HMM!!!hmm???

Page 6: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT

SNPs

TGAGTT 12210 TGACTT 12120TGAATT 12030TGATTT 12300

AB SOLiD: Color space is complex!

INDELS

TGAGTTA 122103

TGA-TTA 12-303

TGAGTTTA 1221003

TGAGTATA 1221333It’s

bloody complicated!

Page 7: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

AB SOLiD: Translations

• Look at: 012233102• Recall: 012033102• 4 translations for every color sequence

A A C T T A T G G A A G

C T

1

2

2

33

0 0

00

1

0 1 2 0 3 3 1 0 2

C C A G G C G T T C

G G T C C G C A A G

T T G A A T A C C T

TGAGCGTTC|||TGAATAGGA

TGAGCGTTC|||||||||TGAGCGTTC

Page 8: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

AB SOLiD: Modified Smith-Waterman

• 4 S-W matrices, one per translation• Errors transition into other matrix• ‘Crossover’ penalty charged for errors

Translation A Translation C

T T GT T

GGe

no

me

G A T A C C T C C A A G C G T T C

A G

C G

T T

C

Page 9: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

AB SOLiD: Obligatory Comparison

• SHRiMP and AB Mapper (1.6)– SHRiMP seed weight 8 (1111001111)– AB 35_2, 35_3 schemas

• 10,000 35bp reads– C. savignyi (173Mb), very high polymorphism

• Considering single top hits only

SHRiMP AB 35_2 AB 35_3

% mapped 19.83 6.67 10.94

Runtime 13m04 1h24 2h25

Page 10: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

AB SOLiD: Resultant Alignments

• SHRiMP emits letter-space alignments

– Clear to biologists

– Color-space need not be scary!

G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| |||T: GAaCCCCTTACAACTGAACCCC-TACR: 1 T1211000203110121201000-231 25

Page 11: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

Outline

1. AB SOLiD Reads

2. 2-pass (SMS) Reads

Page 12: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

2-pass SMS Reads

• SMS reads have high error rates

– “Dark bases” (skipped letters)

– Multiple passes are possible

– Ameliorate errors over passes• Good chance of missing base in one read• Acceptable chance of getting it in at least one

Page 13: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

Mapping 2-pass Reads

ReadsOriginal

C-GACTTTACTGACTTA

CTGA-T---

Reference Genome

?

Page 14: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

CTG-ACTCAGCA-T

C T G A C T

C

A

G

C

A

T

Match = +4 Mismatch = -3 Gap = -2

S=9

SMS 2-pass: SHRiMP with 2 reads

CTGCACT

Page 15: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

C T G A C T

C

A

G

C

A

T

Match = +4 Mismatch = -3 Gap = -2

CTGAC-TCAG-CAT

SMS 2-pass: SHRiMP with 2 reads

CTG-ACTCAGCA-TS=9

CTGCACT

CTGACAT

Page 16: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

C T G A C T

C

A

G

C

A

T

Match = +4 Mismatch = -3 Gap = -2

C-TG-ACTCA-GCA-T

CT-GAC-TC-AG-CAT

S=8

SMS 2-pass: SHRiMP with 2 reads

CTGAC-TCAG-CAT

CTG-ACTCAGCA-TS=9

CTGCACT

CTGACAT

CATGCACT

CTAGACAT

C-TGAC-TCA-G-CAT

CT-GAC-TC-AG-CAT

CATGCACT

CTAGACAT

Page 17: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

C T G A C T

C

A

G

C

A

T

Match = +4 Mismatch = -3 Gap = -2

SMS 2-pass: Near-optimal Alignments

•Compute a DP matrix

•Sum it up with the DP matrix computed in reverse +

0 -2 -4 -6 -8 -10 -12

-2 4 2 0 -2 -4 -6

-4 2 1 -1 4 2 0

-6 0 -1 5 3 1 -1

-8 -2 -3 3 2 7 5

-10 -4 -5 1 7 5 4

-12 -6 0 -1 5 4 9

9 3 5 6 0 -6 -12

3 5 6 8 2 -4 -10

4 6 8 2 4 -2 -8

2 1 3 4 6 0 -6

-4 -2 0 6 1 2 -4

-6 -4 -2 0 2 4 -2

-12 -10 -8 -6 -4 -2 0

Page 18: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

C T G A C T

C

A

G

C

A

T

Match = +4 Mismatch = -3 Gap = -2

SMS 2-pass: Near-optimal Alignments

•Compute a DP matrix

•Sum it up with the DP matrix computed in reverse

•Leave only near optimal alignments

=

9

9 8

8 9

9 9

9 9

9 9

9

9 1 1 0 -8 -16 -24

1 9 8 7 0 -8 -16

0 8 9 1 7 0 -8

-4 0 1 9 9 1 -7

-12 -4 -3 9 3 9 1

-16 -8 -7 1 9 9 2

-24 -16 -8 -7 1 2 9

Represent the remaining cells as a directed graph (Shwikowski & Vingron, 2003)

AT

—T

A—

CC

A—

—T

GG

CC

A—

—A

AA

—C

C—

Page 19: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

• Build a DAG representing the (near) optimal alignments of the two reads

• Generate seeds (short paths) from the DAG

• Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW.

• Do full alignment for top hits

SMS 2-pass: SHRiMP with 2-pass data

AT

—T

A—

CC

A—

—T G

G

CC

A—

—A

AA

—C

C—

TT

Page 20: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

Type Separate Profile WSG

No hits % 0.13 4.91 4.31

Multiple % 26.45 9.34 9.13

Uniq cor % 63.00 74.90 75.84

Runtime 9m 11m 12m

SMS 2-pass: Results (in brief)

• 10,000 synthetic reads (~25-65 bp)– 7% deletion,1% insertion, 1% sub rate

• Mapped to Human chromosome 1– Spaced seed weight 8: 111101111

Page 21: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

• Fast mapping of short reads to a genome

-- Handles:

• color-space (SOLiD) reads

• 2-pass (SMS) reads

• insertions and deletions

-- Easy to parallelize

• Computation of p-values & other statistics for hits

SHRiMP Summary

Page 22: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

• Faster Mapping (biggest complaint)

• Matepair data support

• Transcriptome Data

• Suggestions?

SHRiMP TODO List

Page 23: SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08

Acknowledgements

SHRiMP is brought to you by:

– Steve Rumble– Vlad Yanovsky– Adrian Dalca – Marc Fiume

– Phil Lacroute– Arend Sidow

http://compbio.cs.toronto.edu/shrimp

University of Toronto

Stanford University