aligning alignments exactly

23
Aligning Alignments Aligning Alignments Exactly Exactly By John Kececioglu, Dean By John Kececioglu, Dean Starrett Starrett CS Dept. Univ. of Arizona CS Dept. Univ. of Arizona Appeared in 8 Appeared in 8 th th ACM RECOME ACM RECOME 2004, 2004, Presented by Jie Meng Presented by Jie Meng

Upload: anne

Post on 23-Jan-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Aligning Alignments Exactly. By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng. Background Definition Hardness An Exponential time algorithm. Alignments. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Aligning Alignments Exactly

Aligning Alignments ExactlyAligning Alignments Exactly

By John Kececioglu, Dean StarrettBy John Kececioglu, Dean StarrettCS Dept. Univ. of ArizonaCS Dept. Univ. of Arizona

Appeared in 8Appeared in 8thth ACM RECOME 2004, ACM RECOME 2004,

Presented by Jie MengPresented by Jie Meng

Page 2: Aligning Alignments Exactly

BackgroundBackground DefinitionDefinition HardnessHardness An Exponential time algorithmAn Exponential time algorithm

Page 3: Aligning Alignments Exactly

AlignmentsAlignments

Given two (DNA or Protein) sequences, an Given two (DNA or Protein) sequences, an alignment puts them against each other alignment puts them against each other such that the similar parts are aligned as such that the similar parts are aligned as close as possible, for example:close as possible, for example:

A T – C – T C G C TA T – C – T C G C T- T G - A T G – A T- T G - A T G – A T

There are four kinds of alignments

Match

Insertion;

Deletion;

Mismatch

Page 4: Aligning Alignments Exactly

Scoring AlignmentsScoring Alignments

There are four types of aligned columns:There are four types of aligned columns:– Match – Score Match – Score matchmatch = 0. = 0.

– Mismatch – Score Mismatch – Score mismatchmismatch 0. 0.

– Insertion – Score Insertion – Score insertioninsertion 0. 0.

– DeletionDeletion – Score – Score deletiondeletion 0. 0.

The The scorescore of an alignment is defined to be the of an alignment is defined to be the sumsum of the score of the aligned columns. of the score of the aligned columns.

The goal is to minimize the scoreThe goal is to minimize the score

Page 5: Aligning Alignments Exactly

Gap-costGap-cost

We can extend the score We can extend the score indel indel by by openopen and and extensionextension, then for a gap of size x, we have , then for a gap of size x, we have openopen +x* +x* extensionextension instead of x* instead of x* indel indel ..

AT----CGCTTCAT AT----CGCTTCAT -TGCAT—AT----- -TGCAT—AT-----

openopen +4* +4* extensionextension

Page 6: Aligning Alignments Exactly

Multiple AlignmentsMultiple Alignments

In general we also need compare In general we also need compare multiplemultiple sequences and find the similarities.sequences and find the similarities.

Multiple alignmentMultiple alignment generalizes the generalizes the alignment idea to handle many alignment idea to handle many sequences.sequences.

AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT

Page 7: Aligning Alignments Exactly

Sum-of-Pairs (SP) ScoreSum-of-Pairs (SP) Score

Given a multiple alignment, the Given a multiple alignment, the sum-of-pairssum-of-pairs (SP) (SP) score is given by the sum of the score is given by the sum of the inducedinduced pairwise pairwise alignment scores of each pair in the alignment.alignment scores of each pair in the alignment.

AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT

AT-C-TCGAT -TGCAT--AT AT-C-TCGATAT-C-TCGAT -TGCAT--AT AT-C-TCGAT

-TGCAT--AT ATCCA-CGCT ATCCA-CGCT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT

+ +

Page 8: Aligning Alignments Exactly

BAD NEWSBAD NEWS

Multiple alignment is NP-hardMultiple alignment is NP-hard

One methods is to approximate the One methods is to approximate the optimal value; optimal value;

Progressive alignments Progressive alignments

A problem arised natually: A problem arised natually: Aligning AlignmentsAligning Alignments

Page 9: Aligning Alignments Exactly

Aligning Alignments

Let S be a collection of strings s1, s2, s3…sk, over alphabet ;

An alignment of S is a matrix A with k rows such that:i) Each entry is either a letter or a space;ii) No column is all space;iii) Reading across row i and remove space, we get string si;

Like before, we have three types of aligning score:match, mismatch and substitution;

Page 10: Aligning Alignments Exactly

Aligning Alignments

Given two alignments A with k sequences of length N, B with l sequences of length M, we want to align the columns of A and B;

AT-C-TCGAT-TGCAT--ATATCCA-CGAT

CT-ATTGGAT-TTAT-G--TCTTA-GGGAT

Page 11: Aligning Alignments Exactly

Aligning Alignments

In other word, We treat the columns of A and B as single letters, just like aligning two sequences.

CTGT-T

AT-TGT

C-TG-T--T

-AT--T-GT

Page 12: Aligning Alignments Exactly

Aligning Alignments

The score function is still sum-of-pair, namely

We note that the alignment of Ai’ and Bj’ may contain space in both sequences, so we just remove the space here

Ai’: a----aa-a

Bj’: aaa-a-a-a

ki lj

ji BAD1 1

'' ),(

Page 13: Aligning Alignments Exactly

Aligning Alignments

Without gap cost, aligning alignments is polynomial time solvable. We can apply dynamic programming like we did in aligning sequences; the only difference here is that we align columns.

Page 14: Aligning Alignments Exactly

Aligning Alignments

With gap cost, this problem is NP-complete We can use a reduction from MAX-CUT problem MAX-CUT: Given a graph G=(V, E), and a integer

c, ask whether there is a partition of V: V= L R and , such that the size of the cut is no less than c;

By cut, it means the set of edges which have one end vertex in L and another is in R;

RL

Page 15: Aligning Alignments Exactly

NP-hardnessNP-hardness

• Given an instance of MAX-CUT G=(V,E), V={v1, v2, …vn} and E={e1, e2, … em},and a integer c;

• we construct two multiple alignments A and B over alphabet {0,1}: both A and B has m edge rows and k dummy rows, each edge rows corresponding an edge; A has 2n columns, every two continuous columns correspond a vertex; B has 3n columns, every three continuous columns correspond a vertex;

Page 16: Aligning Alignments Exactly

NP-hardnessNP-hardness

• The dummy rows in A are (0-)n, dummy rows in B are (0--)n;

• As to the edge rows in A: suppose the row for e, and e=(vi, vj), then in columns i and j, there are substring, “-1”, and space elsewhere;

• As to the edge rows in B: suppose the row for e, and e=(vi, vj), (i<j), then in columns i, there is a substring “010”, in columns j, there is a substring “-10”

Page 17: Aligning Alignments Exactly

NP-hardnessNP-hardness

• Simply we let score for match is 0,

score for mismatch is 1,

and gap open cost is 2, gap extension cost is 1

ask whether there is an alignment such that the score is less then d-c;

So we have an instance of Aligning Alignments.

Page 18: Aligning Alignments Exactly

HOMEWORK4HOMEWORK4

• Given a set of multiple alignments {A1, A2, … An}, each Ai is a multiple alignment with ki sequences, without gap cost, is the problem of multiple alignment on those alignments {A1, A2, … An} hard or easy, use the method in this paper to align multiple alignments, i.e. align columns. If hard, prove it; otherwise, give an efficient algorithm and prove complexity and correctness.

Page 19: Aligning Alignments Exactly

Exact Algorithm

The basic idea is still dynamic programming; We have to remember extra information by a set,

so-called shape, S : for each row in a multiple alignment, we record the columns of the right-most letters.

Page 20: Aligning Alignments Exactly

Exact Algorithm

S(i, j)=

B[j])(A[i],1)-j1,-S(i

B[j])(-,1)-jS(i, (A[i],-)j)1,-S(i

0j and 0i }{

0jor 0i {}

Page 21: Aligning Alignments Exactly

Exact Algorithm

C(i,j,t)=min

Where g(A[i], B[j], s) means the total number of gaps initiated by appending column A[i] and B[j] onto an alignment that ends in shape s;

}]),[],,[()],[],[(s)1,-j1,-{C(i min

|}][|*)],[,(s)1,-j{C(i, min

|}][|*),],[(s)j,1,-{C(i min

open

tB[j])(A[i],s&1)-j1,-S(is

extensionopen

tB[j])(-,s&1)-jS(i,s

extensionopen

t(A[i],-)s&j)1,-S(is

BqAp

jqBipADsjBiAg

jBksjBg

iAlsiAg

Page 22: Aligning Alignments Exactly

Exact Algorithm

The optimum value is

The problem here is the number of shapes maybe too many, so in the worst case the time and space complexity is

)},,({],[

snmCMinnmSs

nk ,)23((

nk ,)()23((

2

12

3

22

nk

kkn

n

k

Page 23: Aligning Alignments Exactly

Any Questions?

423B

[email protected]