bi820 2005 s marth poly bayes

9
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA 02467

Upload: sean-paul

Post on 14-Jul-2015

538 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Bi820 2005 S Marth Poly Bayes

General methods of SNP discovery: PolyBayes

Gabor T. Marth

Department of BiologyBoston CollegeChestnut Hill, MA 02467

Page 2: Bi820 2005 S Marth Poly Bayes

General methods of SNP mining – PolyBayes

2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing error true polymorphism

1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources

Two innovative ideas:

Page 3: Bi820 2005 S Marth Poly Bayes

Computational SNP mining – PolyBayes

sequence clustering simplifies to database search with genome reference

paralog filtering by counting mismatches weighed by quality values

multiple alignment by anchoring fragments to genome reference

SNP detection by differentiating true polymorphism from sequencing error using quality values

Page 4: Bi820 2005 S Marth Poly Bayes

Sequence clustering

• Clustering simplifies to search against sequence database to recruit relevant sequences

cluster 1 cluster 2 cluster 3

genome reference

fragments

• Clusters = groups of overlapping sequence fragments matching the genome reference

Page 5: Bi820 2005 S Marth Poly Bayes

(Anchored) multiple alignment

• Advantages

• efficient -- only involves pair-wise comparisons

• accurate -- correctly aligns alternatively spliced ESTs

• The genomic reference sequence serves as an anchor

• fragments pair-wise aligned to genomic sequence

• insertions are propagated – “sequence padding”

Page 6: Bi820 2005 S Marth Poly Bayes

Paralog filtering

• The “paralog problem”

• unrecognized paralogs give rise to spurious SNP predictions

• SNPs in duplicated regions may be useless for genotyping

Paralogous difference

Sequencing errors

• Challenge

• to differentiate between sequencing errors and paralogous difference

Page 7: Bi820 2005 S Marth Poly Bayes

SNP detection

• Goal: to discern true variation from sequencing error

sequencing error polymorphism

Page 8: Bi820 2005 S Marth Poly Bayes

genome reference sequence

1. Fragment recruitment (database search)

2. Anchored alignment

3. Paralog identification

4. SNP detection

SNP discovery with PolyBayes

Page 9: Bi820 2005 S Marth Poly Bayes

∑∑ ∑

∈ ∈

⋅⋅⋅

⋅⋅⋅=

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

probability of polymorphism base call,

base quality

a priori polymorphism rate

base composition

depth of coverage

1. The algorithm

Bayesian-statistical SNP detection