multiple sequence alignment (i)

23
Multiple Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 4, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Upload: shea

Post on 11-Feb-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Multiple Sequence Alignment (I). (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 4, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Outline. Motivation Scoring of multiple sequence alignments Algorithms Dynamic programming - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple Sequence Alignment (I)

Multiple Sequence Alignment (I)

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Oct. 4, 2005

ChengXiang Zhai

Department of Computer ScienceUniversity of Illinois, Urbana-Champaign

Page 2: Multiple Sequence Alignment (I)

Outline

• Motivation

• Scoring of multiple sequence alignments

• Algorithms– Dynamic programming – Progressive alignment (next class)

Page 3: Multiple Sequence Alignment (I)

Why Multiple Alignments?• Characterize protein families: Identify

shared regions of homology in a multiple sequence alignment

• Determination of the consensus sequence of several aligned sequences.

• Help predict the secondary and tertiary structures of new sequences

• Help predict the function of new sequences

• Preliminary step in molecular evolution analysis using phylogenetic trees.

Page 4: Multiple Sequence Alignment (I)

Example of Multiple Alignment

Multiple sequence alignment of 7 neuroglobins using clustalx(Slide from Craig A. Struble)

Page 5: Multiple Sequence Alignment (I)

4 Basic Questions in Multiple Alignment

X1=x11,…,x1m1Model: scoring function s: A

Possible alignments of all Xi’s: A ={a1,…,ak}

Find the best alignment(s)

1 2* arg max ( ( , ,..., ))a Na s a X X X

Q3: How can we find a* quickly?

Q1: How should we define s?

S(a*)= 21

Q4: Is the alignment biologically Meaningful?

Q2: How should we define A?

X2=x21,…,x2m2

XN=xN1,…,xNmN

X1=x11,…,x1m1

X2=x21,…,x2m2

XN=xN1,…,xNmN

Page 6: Multiple Sequence Alignment (I)

Defining Multi-Sequence Alignment• We may generalize our definition of pairwise sequence

alignment• Alignment of 2 sequences is represented as a 2-row matrix• In a similar way, we represent alignment of 3 sequences as

a 3-row matrix A T _ G C G _A _ C G T _ AA T C A C _ A

• A column must have at least one nucleotide • Question: How many possible global alignments are there

for 3 sequences each of length 2?

Page 7: Multiple Sequence Alignment (I)

How do we score a multiple alignment?

Page 8: Multiple Sequence Alignment (I)

Scoring a Multiple Alignment• Ideally, it should be based on evolutionary

models

• In practice, – We often assume columns are independent

– Use “Sum of Pairs” (SP scores)

( ) ( )ii

S m G s m

( ) ( , )

. ., ( , ) 0, ( , ) ( , ) , ( , ) ( , )

k li i i

k l

S m s m m

E g s s a s a d s a b BLOSUM a b

G is the gap score

Page 9: Multiple Sequence Alignment (I)

Minimum Entropy Scoring

''

( ) logi ia iaa

iaia

iaa

S m p p

cpc

Intuition: A perfectly aligned column has one single symbol (least uncertainty)A poorly aligned column has many distinct symbols (high uncertainty)

Count of symbol a in column i

This is related to the HMM formulation of the alignment problem, which we will cover later …

Page 10: Multiple Sequence Alignment (I)

Entropy: Example

0

AAAA

entropy

2)241(4

41log

41

CGTA

entropy

Best case

Worst case

Page 11: Multiple Sequence Alignment (I)

Entropy of an Alignment: Example

column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)

•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0

•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811

•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2

•Alignment Entropy = 0 + 0.811 + 2 = +2.811

A A A

A C C

A C G

A C T

Page 12: Multiple Sequence Alignment (I)

How can we find a multiple alignment quickly?

Can we generalize the dynamic programming algorithm used for pairwise alignment?

Page 13: Multiple Sequence Alignment (I)

Alignments = Paths in…

• Align 3 sequences: ATGC, AATC,ATGC

A A T -- C

A -- T G C

-- A T G C

Page 14: Multiple Sequence Alignment (I)

Alignment Paths

0 1 1 2 3 4

A A T -- C

A -- T G C

-- A T G C

x coordinate

Page 15: Multiple Sequence Alignment (I)

Alignment Paths• Align the following 3 sequences: ATGC, AATC,ATGC

0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

-- A T G C

x coordinate

y coordinate

Page 16: Multiple Sequence Alignment (I)

Alignment Paths

0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

0 0 1 2 3 4

-- A T G C

• Resulting path in (x,y,z) space:

(0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)

x coordinate

y coordinate

z coordinate

Page 17: Multiple Sequence Alignment (I)

2-D vs 3-D Alignment Grid

V

W

2-D edit graph

3-D?

Page 18: Multiple Sequence Alignment (I)

Architecture of 3-D Alignment Grid

In 3-D, 7 edges in each unit cube

In 2-D, 3 edges in each unit square

Page 19: Multiple Sequence Alignment (I)

A Cell of 3-D Alignment Grid(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k) (i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)

Page 20: Multiple Sequence Alignment (I)

Multiple Alignment: Dynamic Programming

• si,j,k = max

(x, y, z) is an entry in the 3-D scoring matrix and can be computed using sum of pairs or entropy

si-1,j-1,k-1 + (vi, wj, uk)si-1,j-1,k + (vi, wj, _ )si-1,j,k-1 + (vi, _, uk)si,j-1,k-1 + (_, wj, uk)si-1,j,k + (vi, _ , _)si,j-1,k + (_, wj, _)si,j,k-1 + (_, _, uk)

cube diagonal: no indels

face diagonal: one indel

edge diagonal: two indels

Page 21: Multiple Sequence Alignment (I)

Multiple Alignment: Running Time

• For 3 sequences of length n, the run time is 7n3; O(n3)

• For k sequences, building a k-dimensional edit graph has run time (2k-1)(nk); O(2knk)

• Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

Page 22: Multiple Sequence Alignment (I)

In the next class, we will cover more efficient algorithms -- progressive alignment ….

Page 23: Multiple Sequence Alignment (I)

What You Should Know

• How to score a multi-sequence alignment

• How the dynamic programming algorithm works

• Computational complexity of dynamic programming algorithms