multiple sequence alignment (msa)

38
1 Multiple Sequence Alignment (MSA) (MSA)

Upload: tarak

Post on 25-Feb-2016

71 views

Category:

Documents


2 download

DESCRIPTION

Multiple Sequence Alignment (MSA). Multiple Alignment. Number of sequences >2 Global alignment Seek an alignment that maximizes score. Multiple Alignment. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple Sequence Alignment (MSA)

1

Multiple Sequence Alignment(MSA)(MSA)

Page 2: Multiple Sequence Alignment (MSA)

2

Multiple Alignment

• Number of sequences >2• Global alignment• Seek an alignment that maximizes score

Page 3: Multiple Sequence Alignment (MSA)

3

Multiple AlignmentMultiple sequence alignment can be viewed as an extension of pairwise sequence alignment, but the complexity of the computation grows exponentially with the number of sequences being considered and their lengths and, therefore, it is not feasible to search exhaustively for the optimal alignment even for even a modest number of short sequences.

For example, there are approximately 1038 possible alignments that can be produced from five 10-nucleotide-long sequences

Page 4: Multiple Sequence Alignment (MSA)

4

Possible Alignments

TCG-GC-TCGACGAGCGT-GAT--G-GCG-AG---CCG-TCGTA--AC

TCGGCTCGACGAGCGTGAT-GGCGAGC---CGTCGTAAC- … 1038

Page 5: Multiple Sequence Alignment (MSA)

5

Multiple Alignment

• In addition to the scores employed by pairwise alignment, we have additional scores.

• However, we still seek an alignment that maximizes score

Page 6: Multiple Sequence Alignment (MSA)

6

Similarity Scoring Scheme

{4,0,0,0,0} = 1{4,0,0,0,0} = 1{3,1,0,0,0} = 0.75{3,1,0,0,0} = 0.75 {2,2,0,0,0} = 0.5{2,2,0,0,0} = 0.5{2,1,1,0,0} = 0.5{2,1,1,0,0} = 0.5 {1,1,1,1,0} = 0{1,1,1,1,0} = 0

5 character states: A, C, T, G, –5 character states: A, C, T, G, –

Page 7: Multiple Sequence Alignment (MSA)

7

Two Possible Alignments

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

GTCGTAGTCGGCTCGACGTCTAGCGAGCGTGAT-GCGAAGAGGCGAGC---GCCGTCGCGTCGTAAC-

1*1 + 3*0.75 + 12*0.5 = 9.25

4*1 + 13*0.75 + 2*0.5 = 11.75

Page 8: Multiple Sequence Alignment (MSA)

8

Alignments can be easy or difficult

Easy

Difficult

GCGGCCCA TCAGGTAGTT GGTGG

GCGGCCCA TCAGGTAGTT GGTGG

GCGTTCCA TCAGCTGGTT GGTGG

GCGTCCCA TCAGCTAGTT GGTGG

GCGGCGCA TTAGCTAGTT GGTGA

***...** *.**.*.*** ****. TTGACATG CCGGGG---A AACCG

T-GACATG CCGGTG--GT AAGCC

TTGGCATG -CTAGG---A ACGCG

TTGACATG -CTAGGGAAC ACGCG

TTGACATC -CTCTG---A ACGCG

* *.***. *...* . *..*.

Page 9: Multiple Sequence Alignment (MSA)

9

Multiple AlignmentMultiple Alignment• Dynamic programming (exhaustive, exact)

– Consider 2 protein sequences of 100 amino acids in length.– If it takes 1002 (103) seconds to exhaustively align these

sequences, then it will take 104 seconds to align 3 sequences, 105 to align 4 sequences, etc.

– It will take ~1021 seconds to align 20 sequences. One year is ~3 ✕ 107 seconds. The age of the visible universe is ~1018 seconds.

• Progressive alignment (heuristic, approximate)

Page 10: Multiple Sequence Alignment (MSA)

10

Progressive Alignment• Devised by Feng and Doolittle in 1987.• Essentially a heuristic method and, as such, not

guaranteed to find the “optimal” or “best” alignment.• Requires pairwise alignments as a starting

point• One of the first successful implementation was Clustal

(by Des Higgins et al.)

n(n 1)2

Page 11: Multiple Sequence Alignment (MSA)
Page 12: Multiple Sequence Alignment (MSA)

12

• Sum of pairwise alignment scores– For n sequences, there are n(n-1)/2 pairs

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

• Based on pairwise alignment scores– Build n by n table of pairwise scores

• Align similar sequences first– After alignment, consider as single sequence– Continue aligning with further sequences

Page 13: Multiple Sequence Alignment (MSA)

13

ABCD

A B C D

A

B 11

C 3 1

D 2 2 10

Compute the pairwise Compute the pairwise alignments for all against all alignments for all against all

(6 pairwise alignments)(6 pairwise alignments)the similarities are stored in a the similarities are stored in a

tabletable

First step:

Page 14: Multiple Sequence Alignment (MSA)

14

A B C D

AB 11C 3 1D 2 2 10

A

DC

B

Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):

• Represents the order in which pairs of Represents the order in which pairs of sequences are to be alignedsequences are to be aligned• Similar sequences are neighbors in the Similar sequences are neighbors in the tree tree • Distant sequences are distant from Distant sequences are distant from each other in the treeeach other in the tree

Second step:

Guide tree

Page 15: Multiple Sequence Alignment (MSA)

15

Guide Tree

A guide tree is not a phylogenetic treeA guide tree is not a phylogenetic tree!!

Page 16: Multiple Sequence Alignment (MSA)

16

A

DC

B

Align most similar pairsAlign most similar pairs

Align the alignments as if Align the alignments as if each of them was a single each of them was a single sequence (replace with a sequence (replace with a single consensus sequence single consensus sequence or use a profile)or use a profile)

Third step:

Page 17: Multiple Sequence Alignment (MSA)

17

Alignment of alignments

M Q T FL H T WL Q S W

L T I FM T I W

M Q T - FL H T - WL Q S - WL - T I FM - T I W

X

Y

Page 18: Multiple Sequence Alignment (MSA)
Page 19: Multiple Sequence Alignment (MSA)

19

> Usually FASTA format

Input File

Page 20: Multiple Sequence Alignment (MSA)

20

> Sulfolobus acidocaldarius gi|152927|gp|J03218|SSOATPMA_1 MVSEGRVVRVNGPLVIADGMREAQMFEVVYVSDLKLVGEITRIE

> Thermococcus sp. gi|2605627 ATPase alpha subunit MGRIIRVTGPLVVADGMKGAKMYEVVRVGEMGLIGEIIRLEGDKAVIQVYEETAGIRPGEPVEGTGSSLS

> Acetabularia acetabulum gi|1303673|gnl|PID|d1009732 adenosine triphosphatase A subunit MSKAKEGDYGSIKKVSGPVVVADNMGGSAMYELVRVGTGELIGEIIRLEGDTATIQVYEETSGLTVGDGV …

Input file

Output file

Page 21: Multiple Sequence Alignment (MSA)

21

Sequences that are similar only in some smaller regions may be misaligned because MSA tries to find global, rather than local alignments.

Sequence that contains a large insertion compared to the rest may be misaligned because MSA tries to find global, rather than local alignments.

MSA: Problems

Page 22: Multiple Sequence Alignment (MSA)

22

Sequence that contains a repetitive element (such as a domain), while another sequence only contains one copy.

VS

MSA: Problems

Page 23: Multiple Sequence Alignment (MSA)

23

• Pairwise alignment is an optimal algorithm.

• Multiple alignment is not an optimal algorithm. Better alignments might exist!

• The algorithm yields a possible alignment, but not necessarily the best one.

MSA: Problems

Page 24: Multiple Sequence Alignment (MSA)

24

MSA: Problems

• Because of the progressive methodology, errors at the beginning of the alignment are more important than errors at the end of the alignment.

Page 25: Multiple Sequence Alignment (MSA)

25

Clustal: Advice• Sequence weighting (Should we treat all the sequences as equals?) • Position weighting (Should we treat all positions in the sequences

as equals?)• Varying substitution matrices• Residue-specific gap penalties and reduced penalties in hydrophilic

regions (external regions of protein sequences), encourage gaps in loops rather than in core regions.

• Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage openings in subsequent alignments

• Vary “gap opening penalty” and “gap extension penalty”• Discourage too many gaps, too close to one another• Take into account hydrophilic and hydrophobic structures• Avoid divergent sequences, as they are the most difficult to align

Page 26: Multiple Sequence Alignment (MSA)

26

Alignment of protein-coding DNA sequences

• It is not very sensible to align the DNA sequences of protein-coding genes.

ATGCTGTTAGGGATGCTCGTAGGG

ATGCT-GTTAGGGATGCTCGTA-GGG

The result might be implausible and might not reflect what is known about biological processes.It is much more sensible to translate the sequences into their corresponding amino acid sequences, align these protein sequences and then put the gaps in the DNA sequences according to the amino acid alignment.

Page 27: Multiple Sequence Alignment (MSA)

27

Effect of gap penalties on amino-acid alignment Human pancreatic hormone precursor versus chicken pancreatic hormone

(a) Penalty for gaps is 0(b) Penalty for a gap of size k nucleotides is wk = 1 + 0.1k(c) The same alignment as in (b), only the similarity between the two sequences is further enhanced by showing pairs of biochemically similar amino acids

Page 28: Multiple Sequence Alignment (MSA)

28

The anchored-alignment procedure uses expert knowledge to improve multiple alignments. The user can specify a list of anchor pointsanchor points, each of which consists of a pair of equal-length segments that are to be aligned by the program.

If residue x from one of the input sequences is paired to residue y from another sequence, then y is the only residue that can be aligned to x, and vice versa.

All residues to the left and the right of x are aligned, respectively, to the residues to the left and the right of y.

Anchored alignment:

Page 29: Multiple Sequence Alignment (MSA)

29

Consider, for example, the following example:

>seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK

The non-anchored default version of DIALIGN would calculate the following alignment for this input sequence set:

seq1 1 WKKNAD-----APKRamtsfmKAAY------------ seq2 1 WNLDTN-----SPEE------KQAYiqlaKDDriryd seq3 1 WRMDSNqknpdSNNP------KAAYn---KGDsnapk

Page 30: Multiple Sequence Alignment (MSA)

30

Now let's assume, the user has some expert knowledge about a certain domain that is present in all of the input sequences; the domains in the three sequences are thought to be homologous to each other:

>seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK

The user wants to define this motif as anchor and align the rest of the sequences automatically, given the pre-defined constraints imposed by this anchor.

Since anchor points are defined as pairs of equal-length segments, we need two anchor points to enforce alignment of the above motif.

Page 31: Multiple Sequence Alignment (MSA)

31

For example, one could choose

Anchor point 1: >seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK

Anchor point 2: >seq1 WKKNADAPKRAMTSFMKAAY >seq2 WNLDTNSPEEKQAYIQLAKDDRIRYD >seq3 WRMDSNQKNPDSNNPKAAYNKGDANAPK

If the above motif is to be aligned by our program, these two anchor points need to be specified.

Page 32: Multiple Sequence Alignment (MSA)

32

Format for user-defined anchor points: To specify a set of anchor points, a file with the coordinates of these anchor points is needed.

Since each anchor point corresponds to a equal-length segment pair involving two of the input sequences, coordinates for anchor points are defined as follows:(1) first sequence involved (2) second sequence involved (3) start of anchor in first sequence (4) start of anchor in second sequence (5) length of anchor. (6) specify a score of an anchor point.

This score is necessary to prioritize anchor point in case they are inconsistent with each other, i.e., if not all of them can be used simultaneously for the same alignment. the above coordinates (in the above given order).

Page 33: Multiple Sequence Alignment (MSA)

33

Thus, the above two anchor points are specified as follows: 1 2 4 6 6 4.5

2 3 6 11 8 1.3

1

2

4

6

6

4.5 and 1.3 are (arbitrary) scores for anchor points 1 and 2.

Page 34: Multiple Sequence Alignment (MSA)

34

MSA Approaches• Progressive approach:

CLUSTALW (CLUSTALX)T-COFFEEPILEUP

• Iterative approach (Repeatedly realigned subsets of sequences):

MultAlin, DiAlign

• Statistical Methods (e.g., Hidden Markov Models)SAM2K

• Genetic algorithmsSAGA

Page 35: Multiple Sequence Alignment (MSA)

35

T-coffee

• An MSA program • Uses principles similar to Clustal• Combining sequence alignment method

with structural alignment techniques • More accurate but longer running time• Limits to the number of sequences it can

align (~100)

Page 36: Multiple Sequence Alignment (MSA)

36

MSA: ProblemsGiddy Landan and Dan Graur. 2007. Heads or Tails: A Simple Reliability Check for Multiple Sequence Alignments. Molecular Biology and Evolution 24(6):1380-1383.

Page 37: Multiple Sequence Alignment (MSA)
Page 38: Multiple Sequence Alignment (MSA)

Genomic alignment (with MAUVE)