bi820 2005 s marth poly bayes
TRANSCRIPT
General methods of SNP discovery: PolyBayes
Gabor T. Marth
Department of BiologyBoston CollegeChestnut Hill, MA 02467
General methods of SNP mining – PolyBayes
2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors sequencing error true polymorphism
1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources
Two innovative ideas:
Computational SNP mining – PolyBayes
sequence clustering simplifies to database search with genome reference
paralog filtering by counting mismatches weighed by quality values
multiple alignment by anchoring fragments to genome reference
SNP detection by differentiating true polymorphism from sequencing error using quality values
Sequence clustering
• Clustering simplifies to search against sequence database to recruit relevant sequences
cluster 1 cluster 2 cluster 3
genome reference
fragments
• Clusters = groups of overlapping sequence fragments matching the genome reference
(Anchored) multiple alignment
• Advantages
• efficient -- only involves pair-wise comparisons
• accurate -- correctly aligns alternatively spliced ESTs
• The genomic reference sequence serves as an anchor
• fragments pair-wise aligned to genomic sequence
• insertions are propagated – “sequence padding”
Paralog filtering
• The “paralog problem”
• unrecognized paralogs give rise to spurious SNP predictions
• SNPs in duplicated regions may be useless for genotyping
Paralogous difference
Sequencing errors
• Challenge
• to differentiate between sequencing errors and paralogous difference
SNP detection
• Goal: to discern true variation from sequencing error
sequencing error polymorphism
genome reference sequence
1. Fragment recruitment (database search)
2. Anchored alignment
3. Paralog identification
4. SNP detection
SNP discovery with PolyBayes
∑∑ ∑
∈ ∈
⋅⋅⋅
⋅⋅⋅=
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
probability of polymorphism base call,
base quality
a priori polymorphism rate
base composition
depth of coverage
1. The algorithm
Bayesian-statistical SNP detection