varid: a variation detection framework for color-space and letter-space platforms

33
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian Pandeliev

Upload: bruis

Post on 19-Mar-2016

32 views

Category:

Documents


1 download

DESCRIPTION

VARiD: A Variation Detection Framework for Color-space and Letter-space platforms. By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno. Presented by Velian Pandeliev. VARiD Overview. Purpose: Variation Detection (SNP, indel) Pitch: First to use both colour-space and letter-space data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VARiD: A Variation Detection Framework for Color-space and

Letter-space platformsBy A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno

Presented by Velian Pandeliev

Page 2: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VARiD Overview• Purpose: Variation Detection (SNP, indel)• Pitch: First to use both colour-space and letter-

space data• Principle: Hidden Markov Model with Forward-

Backward algorithm• Platform: 454/Roche, Solexa, ABI SOLiD• Pros: Can work with unconverted sets of both

formats simultaneously• Performance: linear in length of reference, great

on mixed format data

Page 3: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

ABI SOLiD Basics

• Reads bases two at a time• Outputs one of four colours based

on transition state machine:

Page 4: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

ABI SOLiD Properties

• Read errors and SNPs present differently.Reference:

Page 5: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

ABI SOLiD Properties

• Read errors and SNPs present differently.Reference:

Error:

Page 6: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

ABI SOLiD Properties

• Read errors and SNPs present differently.Reference:

Error:

SNP:

Page 7: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

ABI SOLiD Properties

• A read error propagates through the rest of the sequence on translation to letter-space

Page 8: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Consequences

• Colour-space encoding is better suited to calling SNPs than letter-space encoding

• In letter-space data, errors do not propagate through to the rest of the read

Wouldn’t it be great to have a SNP calling framework that could use both kinds of data!?

Page 9: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VARiD • A Hidden Markov Model for Variation DetectionIn general, HMM’s have the following elements:- States (hidden)- Transitions (probabilities of reaching any particular state from the previous one)- Emissions (observed outputs)

Page 10: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Building a Basic HMMStates: pairs of consecutive letter-space positions:

S = {AA, AT, AC, AGTT, TA, TC, TGCC, CA, CT, CGGG, GA, GT, GC}

Page 11: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Building a Basic HMMTransitions: since consecutive states share a nucleotide, probabilities are defined

as follows:

P(transition WX YZ) =frequency(Z) if X=Y0 if X≠Y

Page 12: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Building a Basic HMMEmissions: a letter and a colour from donor reads at each state.

E.g.P(emission = c|state = CA) = q(c|CA) =

1 – 3ε if c is 1ε if c is 0, 2, 3

for colour space

Page 13: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Building a Basic HMMEmissions: a letter and a colour from donor reads at each state.

E.g.P(emission = n|state = CA) = q(n|CA) =

1 – 3ξ if n is Aξ if n is C, G, T

for letter space

Page 14: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Building a Basic HMMEmission probabilities from all reads:

P(emissions = E|state = s) =

which combines colour and letter space data

EnEc

snqscqsEq )|()|()|(

Page 15: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Building a Basic HMM

Detecting variation is accomplished through finding the maximum likelihood state for each position in the genotype (the donor) and comparing it against the reference nucleotide.

Page 16: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Building a Basic HMM

Source: Dalca, A. & Brudno, M. (Poster)

By running the Forward-Backward algorithm on the HMM, a probability distribution is obtained from the possible states and a base is called (in bold).

Page 17: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

ExtensionsThe HMM described above is quite simple and only calls a single

nucleotide for each position.

VARiD extends the model to detect heterozygous SNPs, as well as to handle indels.

Page 18: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

MicroindelsTo deal with microindels (<5 bp) in the sample, gap states are required:E.g. [A - - - G] (would emit colour 2)- 4 dummy ‘gap’ nucleotides are defined, one for A, C, G, T- [A - - - G] = {(A, gap-A), (gap-A, gap-A), (gapA-gap-A), (gap-A,G)}

Colour 2

Page 19: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

MicroindelsRequires 24 more states:- (X, gapX) x 4

- (gapX, gapX) x 4

- (gapX,Y) x16

- Total (incl. orig.) 40 states

Page 20: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Heterozygous SNPs

• For diploid samples, each state has to account for heterozygous differences• Each state in VARiD’s HMM is a unique combination of two of the original 40 states (obtained by S x S)

• 402 = 1600 states!

Page 21: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Features

• Keeps track of quality scores and positions within a read to augment HMM error rates (ε, ξ) for greater accuracy

• Post-processing ensures that all heterozygous SNP calls are supported by enough reads

Page 22: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Features

Source: Original paper

Page 23: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Features

• First T in a read is NOT part of the sequence.

Page 24: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Features

• First T is NOT part of the genotype!

• VARiD eliminates linker remnant without having to translate fully

Page 25: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VALiDation• 260kb from the human genome• Sequenced with ABI SOLiD and 454/Roche• Reference obtained through Sanger reads• Artificial datasets created with varying

amounts of coverage• Tested in colour-space alone (against Corona),

letter-space alone (against gigaBayes) with various aligners and with a combination of data

Page 26: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VALiDationMeasures:

• True Positives (correctly identified SNPs)• False Positives (SNPs not in Sanger set) • Precision (TP as fraction of all predictions)• Recall (TP as fraction of Sanger set SNPs)

Page 27: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VALiDationColour space only

In colour space, VARiD had slightly higher precision than the Corona caller on AB-mapped reads, but had comparable and slightly lower recall.

Using VARiD with SHRiMP produced a higher recall rate, but a lower precision when compared to VARiD + AB mapper.

(no significance statistics were presented)

Page 28: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VALiDationLetter Space Only

In letter space, gigaBayes + mosaik perfomed better than VARiD (using the same mosaik mapper) with low coverage, but fell behind in higher coverage.

VARiD + SHRiMP did better than VARiD + mosaik in both low and high coverage, and clearly outperformed gigaBayes at 20x coverage

Page 29: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

VALiDationMixed space

VARiD’s true strength lies in being able to combine colour- and letter-space reads and to perform better on them than on cost-equivalent letter-only or colour-only data:

Page 30: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Issues• No statistical significance presented on

performance improvement• Experimental size relatively small (260kb)• Not ideal for low coverage data• Would be interesting to see how VARiD

performs on more diverse data sets (more/fewer SNPs, indels, etc.)

Page 31: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

Issues• No statistical significance presented on

performance improvement• Experimental size relatively small (260kb)• Not ideal for low coverage data• Would be interesting to see how VARiD

performs on more diverse data sets (more/fewer SNPs, indels, etc.)

• Any more?

Page 32: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

The End.

Page 33: VARiD: A Variation Detection Framework for Color-space and Letter-space platforms

References• Dalca, A.V., Rumble, S.M., Levy, S., Brudno, M. VARiD: A

Variation Detection Framework for Color-space and Letter-space platforms. 2010 (in progress)

• Dalca, A.V. & Brudno, M. VARiD: Variation Detection in Color-space and Letter-space (poster)

• Hidden Markov model. (2010, Février 2). In Wikipedia, The Gratuit Encyclopedia. Retrieved 13:24, Février 10, 2010, from http://en.wikipedia.org/w/index.php?title=Hidden_Markov_model&oldid=341442380

• Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M. Sidow, A. and Brudno, M. (2009) SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol.