sequence alignment by information compression

Post on 09-May-2015

288 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

A presentation based on Minh Cao's 2010 paper "A genome alignment algorithm based on compression"

TRANSCRIPT

Sequence Alignment by Information Compression Nacho Caballero

Alignment by Compression

Probability and Information

Traditional Alignments

Traditional Alignments

Traditional alignments can’t handle low complexity regions

AAGCAGAATTTAACATGTGGTTTGCTCATTTGTTCTTTATCGCATCTTTTGAAAACGCTATCGAAATAGCAGTACCTTCAGACTTTTTCCGAATACAGTTTAGCCAAAAATATCAAGAAAAGCTTGAGCGCAAGTTCCTCGAACTTTCTGGACACCCCATTAAACTTTTGTTTGCCGTTAAAAAAGGTACTTATCT !

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN !

50% of the human genome is masked

Traditional scoring schemes don’t reflect sequence bias

GC content

GC skew

Match +8 Mismatch -4 Gap -3

Traditional alignments lack an objective function to measure quality

Probability and Information

Information and probability are two sides of the same coin

1 .5

.25

.05

4.3 bits

Information

I(event) = log21

p(event)= ! log2 p(event)

1 bit

2 bits

Probability event occurs

Information and probability are two sides of the same coin

1 .5

1 bit

.25

2 bits

Information

Maximum in DNA

Probability event occurs

I(event) = log21

p(event)= ! log2 p(event)

AA

A0 bits

AAAAAAAAAAAAAAAA…

AAAAAAATTTTTTTTT… ATGCACTACTAACGGA…

Compression encodes symbols using a probability distribution

00000000000001101010 11011010

AAAAAACGGG

C A G T C

A G

T

Alignment by Compression

Homologous sequences share information

C A G T

CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…

I(Query)

Markov Expert TAGTAACAGTTTCCGAATCAAGCCAAAAAT !

Homologous sequences share information

C A G T

C A G T

CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…

TAGTAACAGTTTCCGAATCAAGCCAAAAAT !

I(Query| Reference) Mutual Information I(Query)

Markov Expert

Align Expert

Homologous sequences share information

C A

G T

C A G T

C A G T

CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…

I(Query| Reference) Mutual Information I(Query)

Markov Expert

Align Expert

TAGTAACAGTTTCCGAATCAAGCCAAAAAT !

XMAligner wins on distantly related biased sequences

Specificity

Sensitivity

XMAligner is the most sensitive detecting exons

XMAligner detecting a gene cluster

PLASMODIUM GENE CLUSTER

producing better results in distantly related or biased sequences

Alignment by compression overcomes the limitations of traditional alignment

top related