pairwise sequence alignment (i) (lecture for cs498-cxz algorithms in bioinformatics) sept. 22, 2005...

31
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http:// www.bioalgorithms.info/slides.htm

Upload: jeffery-james

Post on 30-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Pairwise Sequence Alignment (I)

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Sept. 22, 2005

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm

Page 2: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Comparing Genes in Two Genomes

• Small islands of similarity corresponding to similarities between exons

• Such comparisons are quite common in biology research

Page 3: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment of sequences is one of the most basic and most important problems in bioinformatics…

Page 4: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Outline

• Defining the problem of alignment

• The longest common subsequence problem

• Dynamic programming algorithms for alignment

Page 5: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Aligning Two Strings

Given the strings:

• v = ATGTTAT

• w = ATCGTAC

One possible alignment of the strings:

AT_GTTAT_

ATCGT_A_C

1st row – string v with with space symbols “-” inserted

2nd row – string w with with space symbols “-” inserted

Page 6: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Aligning Two Strings (cont’d)

Another way to represent each row shows the number of symbols of the sequence present up to a given position. For example the above sequences can be represented as:

0 1 2 2 3 4 5 6 7 7

0 1 2 3 4 5 5 6 6 7

AT_GTTAT_ ATCGT_A_C

Page 7: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment Matrix

Both rows of the alignment can be represented in the resulting matrix:

0 1 2 2 3 4 5 6 7 7

0 1 2 3 4 5 5 6 6 7

AT_GTTAT_ ATCGT_A_C

0 1 2 2 3 4 5 6 7 7

0 1 2 3 4 5 5 6 6 7

Page 8: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment as a Path in the Edit Graph

0 0 1 1 2 2 3 4 5 6 7 72 2 3 4 5 6 7 7 A A T _ G T T A T _T _ G T T A T _ A A T C G T _ A _ CT C G T _ A _ C0 0 1 1 2 3 4 5 5 6 6 7 2 3 4 5 5 6 6 7

(0,0) , (0,0) , (1,1)(1,1)

Page 9: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment as a Path in the Edit Graph

0 1 0 1 2 2 2 3 4 5 6 7 72 3 4 5 6 7 7 A A T T _ G T T A T __ G T T A T _ A A T T C G T _ A _ CC G T _ A _ C0 1 0 1 2 2 3 4 5 5 6 6 7 3 4 5 5 6 6 7

(0,0) , (1,1) , (0,0) , (1,1) , (2,2)(2,2)

Page 10: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment as a Path in the Edit Graph

0 1 2 2 0 1 2 2 33 4 5 6 7 7 4 5 6 7 7 A T _ A T _ G G T T A T _T T A T _ A T C A T C G G T _ A _ CT _ A _ C0 1 2 3 0 1 2 3 4 4 5 5 6 6 7 5 5 6 6 7

(0,0) , (1,1) , (2,2), (2,3), (0,0) , (1,1) , (2,2), (2,3), (3,4)(3,4)

Page 11: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment as a Path in the Edit Graph

0 1 2 2 3 4 5 6 7 70 1 2 2 3 4 5 6 7 7 A T _ G T T A T _A T _ G T T A T _ A T C G T _ A _ CA T C G T _ A _ C0 1 2 3 4 5 5 6 6 7 0 1 2 3 4 5 5 6 6 7

(0,0) , (1,1) , (2,2), (2,3), (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)(7,6), (7,7)

- End Result -

Page 12: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment as a Path in the Edit Graph

Every path in the edit graph corresponds to an alignment:

Page 13: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

How to Score an Alignment?

• Simplest

– Every match scores 1

– Every mismatch scores 0

– An alignment is scored based on the number of common symbols

– Lead to the longest common subsequence problem

• More sophisticated

– ?

– ?

– To be covered later

Page 14: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignments in Edit Graph (cont’d)

and represent indels in v and w• Score 0.

represent exact matches. • Score 1.

Page 15: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignments in Edit Graph (cont’d)

The score of the alignment path in the graph is 5.

Page 16: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

The Longest Common Subsequence (LCS) Problem

• Find the longest subsequence common to two strings.

Input: Two strings, v and w.

Output: The longest common subsequence of v and w.

A subsequence is not necessarily consecutive

v = ATGTTAT w = ATCGTAC

v = AT GTTAT | | | | | “ATGTA”w = ATCGT AC

Longest common subsequence Best alignment

Page 17: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

How to solve the LCS problem efficiently?

Page 18: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Brute Force Approach

• Enumerate all the sequences up to length min(|v|,|w|)

• For each one, check to see if it is a subsequence of v and w

• Very expensive…. (How many sequences do we have to enumerate? )

Page 19: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

The Idea of Dynamic Programming

• Think of an alignment as a path in an edit graph

• We only need to keep track of the best alignment (i.e., the longest common subsequence)

• Score a longer alignment based on shorter alignments

Page 20: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment as a Path in the Edit Graph

01201222345673456777v= ATv= AT__GTGTTTAATT__w= ATw= ATCCGTGT__AA__CC 01201233455664556677

(0,0) , (1,1) , (2,2), (0,0) , (1,1) , (2,2), (2,3),(2,3), (3,4), (4,5), (3,4), (4,5), (5,5),(5,5), (6,6), (6,6), (7,6),(7,6), (7,7)(7,7)

Use each cell to store the best alignment so far…

Page 21: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment: Dynamic Programming

Use this scoring algorithm

si,j = si-1, j-1+1 if vi = wj

max si-1, j

si, j-1

Page 22: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Dynamic Programming Example

• There are no matches in the beginning of the sequence

• Label column i=1 to be all zero, and row j=1 to be all zero

Page 23: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Dynamic Programming Example

Si,j = Si-1, j-1

max Si-1, j

Si, j-1

value from NW +1, if vi = wj

value from North (top) value from West (left)

Keep track of the best alignment score and the path contributing to it

Page 24: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Alignment: Backtracking

Arrows show where the score originated from.

if from the top

if from the left

if vi = wj

Page 25: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Dynamic Programming Example

Continuing with the scoring algorithm gives this result.

Page 26: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

LCS Algorithm1.LCS(v,w)2. for i 1 to n

3. Si,0 0

4. for j 1 to m

5. S0,j 0

6. for i 1 to n

7. for j 1 to m

8. si-1,j

9. si,j max si,j-1

10. si-1,j-1 + 1, if vi = wj

11. “ “ if si,j = si-1,j

• bi,j “ “ if si,j = si,j-1

• “ “ if si,j = si-1,j-1 + 1

• return (sn,m, b)

Page 27: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

Now What?

• LCS(v,w) created the alignment grid

• Now we need a way to read the best alignment of v and w

• Follow the arrows backwards from the (|v|,|w|) cell

Page 28: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

LCS Runtime

• To create the nxm matrix of best scores from vertex (0,0) to all other vertices, it takes O(nm) time.

• Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix.

Page 29: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

How do we improve the scoring of alignments?

Can we still find an alignment efficiently?

We’ll talk about these later…

Page 30: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

The LCS Recurrence Revisited

• The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0:

si-1, j-1+1 if vi = wj

si,j = max si-1, j + 0

si, j-1 + 0 Insertion/deletion score

Matching score

Page 31: Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University

What You Should Know

• How an alignment corresponds to a path in an edit graph

• How the LCS problem corresponds to alignment with a simple scoring method

• How the dynamic programming algorithm solves the LCS problem (= simple alignment)