sequence alignment by information compression
Post on 09-May-2015
288 Views
Preview:
DESCRIPTION
TRANSCRIPT
Sequence Alignment by Information Compression Nacho Caballero
Alignment by Compression
Probability and Information
Traditional Alignments
Traditional Alignments
Traditional alignments can’t handle low complexity regions
AAGCAGAATTTAACATGTGGTTTGCTCATTTGTTCTTTATCGCATCTTTTGAAAACGCTATCGAAATAGCAGTACCTTCAGACTTTTTCCGAATACAGTTTAGCCAAAAATATCAAGAAAAGCTTGAGCGCAAGTTCCTCGAACTTTCTGGACACCCCATTAAACTTTTGTTTGCCGTTAAAAAAGGTACTTATCT !
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN !
50% of the human genome is masked
Traditional scoring schemes don’t reflect sequence bias
GC content
GC skew
Match +8 Mismatch -4 Gap -3
Traditional alignments lack an objective function to measure quality
Probability and Information
Information and probability are two sides of the same coin
1 .5
.25
.05
4.3 bits
Information
I(event) = log21
p(event)= ! log2 p(event)
1 bit
2 bits
Probability event occurs
Information and probability are two sides of the same coin
1 .5
1 bit
.25
2 bits
Information
Maximum in DNA
Probability event occurs
I(event) = log21
p(event)= ! log2 p(event)
AA
A0 bits
AAAAAAAAAAAAAAAA…
AAAAAAATTTTTTTTT… ATGCACTACTAACGGA…
Compression encodes symbols using a probability distribution
00000000000001101010 11011010
AAAAAACGGG
C A G T C
A G
T
Alignment by Compression
Homologous sequences share information
C A G T
CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…
I(Query)
Markov Expert TAGTAACAGTTTCCGAATCAAGCCAAAAAT !
Homologous sequences share information
C A G T
C A G T
CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…
TAGTAACAGTTTCCGAATCAAGCCAAAAAT !
I(Query| Reference) Mutual Information I(Query)
Markov Expert
Align Expert
Homologous sequences share information
C A
G T
C A G T
C A G T
CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…
I(Query| Reference) Mutual Information I(Query)
Markov Expert
Align Expert
TAGTAACAGTTTCCGAATCAAGCCAAAAAT !
XMAligner wins on distantly related biased sequences
Specificity
Sensitivity
XMAligner is the most sensitive detecting exons
XMAligner detecting a gene cluster
PLASMODIUM GENE CLUSTER
producing better results in distantly related or biased sequences
Alignment by compression overcomes the limitations of traditional alignment
top related