sequence alignment by information compression

19
Sequence Alignment by Information Compression Nacho Caballero

Upload: nacho-caballero

Post on 09-May-2015

287 views

Category:

Documents


1 download

DESCRIPTION

A presentation based on Minh Cao's 2010 paper "A genome alignment algorithm based on compression"

TRANSCRIPT

Page 1: Sequence Alignment by Information Compression

Sequence Alignment by Information Compression Nacho Caballero

Page 2: Sequence Alignment by Information Compression
Page 3: Sequence Alignment by Information Compression

Alignment by Compression

Probability and Information

Traditional Alignments

Page 4: Sequence Alignment by Information Compression

Traditional Alignments

Page 5: Sequence Alignment by Information Compression

Traditional alignments can’t handle low complexity regions

AAGCAGAATTTAACATGTGGTTTGCTCATTTGTTCTTTATCGCATCTTTTGAAAACGCTATCGAAATAGCAGTACCTTCAGACTTTTTCCGAATACAGTTTAGCCAAAAATATCAAGAAAAGCTTGAGCGCAAGTTCCTCGAACTTTCTGGACACCCCATTAAACTTTTGTTTGCCGTTAAAAAAGGTACTTATCT !

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN !

50% of the human genome is masked

Page 6: Sequence Alignment by Information Compression

Traditional scoring schemes don’t reflect sequence bias

GC content

GC skew

Match +8 Mismatch -4 Gap -3

Page 7: Sequence Alignment by Information Compression

Traditional alignments lack an objective function to measure quality

Page 8: Sequence Alignment by Information Compression

Probability and Information

Page 9: Sequence Alignment by Information Compression

Information and probability are two sides of the same coin

1 .5

.25

.05

4.3 bits

Information

I(event) = log21

p(event)= ! log2 p(event)

1 bit

2 bits

Probability event occurs

Page 10: Sequence Alignment by Information Compression

Information and probability are two sides of the same coin

1 .5

1 bit

.25

2 bits

Information

Maximum in DNA

Probability event occurs

I(event) = log21

p(event)= ! log2 p(event)

AA

A0 bits

AAAAAAAAAAAAAAAA…

AAAAAAATTTTTTTTT… ATGCACTACTAACGGA…

Page 11: Sequence Alignment by Information Compression

Compression encodes symbols using a probability distribution

00000000000001101010 11011010

AAAAAACGGG

C A G T C

A G

T

Page 12: Sequence Alignment by Information Compression

Alignment by Compression

Page 13: Sequence Alignment by Information Compression

Homologous sequences share information

C A G T

CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…

I(Query)

Markov Expert TAGTAACAGTTTCCGAATCAAGCCAAAAAT !

Page 14: Sequence Alignment by Information Compression

Homologous sequences share information

C A G T

C A G T

CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…

TAGTAACAGTTTCCGAATCAAGCCAAAAAT !

I(Query| Reference) Mutual Information I(Query)

Markov Expert

Align Expert

Page 15: Sequence Alignment by Information Compression

Homologous sequences share information

C A

G T

C A G T

C A G T

CCGAATCATGTC !CGAATCATTTAGCCAAAAAT…

I(Query| Reference) Mutual Information I(Query)

Markov Expert

Align Expert

TAGTAACAGTTTCCGAATCAAGCCAAAAAT !

Page 16: Sequence Alignment by Information Compression

XMAligner wins on distantly related biased sequences

Specificity

Sensitivity

Page 17: Sequence Alignment by Information Compression

XMAligner is the most sensitive detecting exons

Page 18: Sequence Alignment by Information Compression

XMAligner detecting a gene cluster

PLASMODIUM GENE CLUSTER

Page 19: Sequence Alignment by Information Compression

producing better results in distantly related or biased sequences

Alignment by compression overcomes the limitations of traditional alignment