sequence alignment by information compression
DESCRIPTION
A presentation based on Minh Cao's 2010 paper "A genome alignment algorithm based on compression"TRANSCRIPT
Sequence Alignment by Information Compression Nacho Caballero
Alignment by Compression
Probability and Information
Traditional Alignments
Traditional Alignments
Traditional alignments can’t handle low complexity regions
AAGCAGAATTTAACATGTGGTTTGCTCATTTGTTCTTTATCGCATCTTTTGAAAACGCTATCGAAATAGCAGTACCTTCAGACTTTTTCCGAATACAGTTTAGCCAAAAATATCAAGAAAAGCTTGAGCGCAAGTTCCTCGAACTTTCTGGACACCCCATTAAACTTTTGTTTGCCGTTAAAAAAGGTACTTATCT !
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN !
50% of the human genome is masked
Traditional scoring schemes don’t reflect sequence bias
GC content
GC skew
Match +8 Mismatch -4 Gap -3
Traditional alignments lack an objective function to measure quality
Probability and Information
Information and probability are two sides of the same coin
1 .5
.25
.05
4.3 bits
Information
I(event) = log21
p(event)= ! log2 p(event)
1 bit
2 bits
Probability event occurs
Information and probability are two sides of the same coin
1 .5
1 bit
.25
2 bits
Information
Maximum in DNA
Probability event occurs
I(event) = log21
p(event)= ! log2 p(event)
AA
A0 bits
AAAAAAAAAAAAAAAA…
AAAAAAATTTTTTTTT… ATGCACTACTAACGGA…
Compression encodes symbols using a probability distribution
00000000000001101010 11011010
AAAAAACGGG
C A G T C
A G
T
Alignment by Compression
Homologous sequences share information
C A G T
CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…
I(Query)
Markov Expert TAGTAACAGTTTCCGAATCAAGCCAAAAAT !
Homologous sequences share information
C A G T
C A G T
CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…
TAGTAACAGTTTCCGAATCAAGCCAAAAAT !
I(Query| Reference) Mutual Information I(Query)
Markov Expert
Align Expert
Homologous sequences share information
C A
G T
C A G T
C A G T
CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…
I(Query| Reference) Mutual Information I(Query)
Markov Expert
Align Expert
TAGTAACAGTTTCCGAATCAAGCCAAAAAT !
XMAligner wins on distantly related biased sequences
Specificity
Sensitivity
XMAligner is the most sensitive detecting exons
XMAligner detecting a gene cluster
PLASMODIUM GENE CLUSTER
producing better results in distantly related or biased sequences
Alignment by compression overcomes the limitations of traditional alignment