multiple sequence alignments

23
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University

Upload: bambi

Post on 05-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Multiple Sequence Alignments. Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University. Overview. Example. Multiple sequence alignment of 7 neuroglobins using clustalx. Example. Searching for domains with RPS-BLAST. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple Sequence Alignments

Multiple Sequence Alignments

Craig A. Struble, Ph.D.Department of Mathematics, Statistics, and Computer ScienceMarquette University

Page 2: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

2

Overview

Page 3: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

3

Example

Multiple sequence alignment of 7 neuroglobins using clustalx

Page 4: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

4

Example

•Searching for domains with RPS-BLAST

Page 5: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

5

Applications of Multiple Sequence Alignment

Identify conserved domains/elements in sequences Compare regions of similarity among

multiple organisms

Identify probes for similar sequences in other organismsDevelop PCR primersPhylogenetic analysis

Page 6: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

6

Definition

A multiple alignment of strings S1, … Sk is a series of strings with spaces S1’, …, Sk’ such that |S1’| = … = |Sk’| Sj’ is an extension of Sj by insertion of

spaces

Goal: Find an optimal multiple alignment.

Page 7: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

7

Scoring Alignments

In order to find an optimal alignment, we need to be able to measure how good an alignment is Sum of pairs (SP) method: in a

column, score each pair of letters and total the scores. Pairs of gaps score 0.

Total up scores for each column

Page 8: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

8

SP Method Example

Using BLOSUM62 matrix, gap penalty -8In column 1, we have pairs

-,S -,S S,S

k(k-1)/2 pairs per column

- I K

S I K

S S E

-8 - 8 + 4 = -12

Page 9: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

9

Dynamic Programming

The dynamic programming approach can be adapted to MSAFor simplicity, assume k sequences of length nThe dynamic programming array F is k-dimensional of length n+1 (including initial gaps)The entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik]

Page 10: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

10

Dynamic Programming

Letting i represent the vector (i1,…,ik) and b represent a nonzero binary vector of length k, we fill in the array with the formula

where (selecting a column to score)

))],,(()([max)( bisColumnSPbiFiFb

kjjcbisColumn 1)(),,(

0 if

1 if ][

j

jjj

j b

bisc

Page 11: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

11

Example

Let i=(1,1,1,1), b=(1,0,0,0)Checking F(0,1,1,1) (i-b)Column(s,i,b) is

SP-score is -24 (assuming gap penalty of -8)

s1: MPEs2: MKEs3: MSKEs4: SKE

M---

Page 12: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

12

Analysis

O(nk) entries to fillEach entry combines O(2k) other entriesCosts O(k2) to calculate each SP scoreOverall cost is O(k2 2k nk), or exponential in the number of sequences!MSA with SP-score shown NP-complete

Page 13: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

13

Star Alignments

Heuristic method for multiple sequence alignmentsSelect a sequence sc as the center of the starFor each sequence s1, …, sk such that index i c, perform a Needleman-Wunsch global alignmentAggregate alignments with the principle “once a gap, always a gap.”

Page 14: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

14

Star Alignments Example

s2

s1s3

s4

s1: MPEs2: MKEs3: MSKEs4: SKE

MPE

| |

MKE

MSKE

- ||

MKE

SKE

||

MKE MPEMKE

-MPE-MKEMSKE

-MPE-MKEMSKE-SKE

Page 15: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

15

Choosing a center

Try them all and pick the one with the best scoreCalculate all O(k2) alignments, and pick the sequence sc that maximizes

cici ssscore ),(

Page 16: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

16

Analysis

Assuming all sequences have length nO(n2) to calculate global alignmentO(k) global alignments to calculateUsing a reasonable data structure for joining alignments, no worse than O(kl), where l is upper bound on alignment lengthsO(kn2+k2l) overall cost

Page 17: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

17

Tree Alignments

Model the k sequences with a tree having k leaves (1 to 1 correspondence)Compute a weight for each edge, which is the similarity scoreSum of all the weights is the score of the treeFind tree with maximum score

Page 18: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

18

Tree alignment example

Match +1, gap -1, mismatch 0

If x=CT and y=CG, score of 6

CAT

GT

CTG

CG

x y

Page 19: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

19

Analysis

The tree alignment problem is NP-complete Hence, phylogenetic tree generation

is NP-complete

Again, likely only exponential time solution available (for optimal answers)

Page 20: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

20

Progressive Approaches

CLUSTALW Perform pairwise alignments Construct a tree, joining most similar

sequences first (guide tree) Align sequences sequentially, using the

phylogenetic tree

PILEUP Similar to CLUSTALW Uses UPGMA to produce tree (chapter 6)

Page 21: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

21

Progressive Approaches

Page 22: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

22

Problems with Progressive Alignments

MSA depends on pairwise alignmentsIf sequences are very distantly related, much higher likelihood of errorsCare must be made in choosing scoring matrices and penaltiesOther approaches using Bayesian methods such as hidden Markov models

Page 23: Multiple Sequence Alignments

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

23

When Craig Talks Next

Introduction to Bayesian StatisticsProfile and Block analysisExpectation Maximization (MEME)Introduction to HMMsMultiple sequence alignments using HMMs