multiple sequence alignmentsvinuesa/tlem/docs/multiple... · alineamientos múltiples de secuencias...

26
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG- UNAM, México. 17-28 de Enero de 2011 Enrique Merino, IBT-UNAM 1 Multiple sequence alignments Special thanks to all the scientis that made public available their presentations throughout the web from where many slides were taken to eleborate this presentation Web sites used in our practice Sequence Retrieval system RSA Tools BLAST Figures are linked to their corresponding web sites ClustalW Structural Criteria Residues are arranged so that those playing a similar role end up in the same column. Evolutive Criteria Residues are arranged so that those having the same ancestor end up in the same column. Similarity Criteria As many similar residues as possible in the same column What is a Multiple Sequence Alignment?

Upload: others

Post on 11-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 1

Multiple sequence alignments

Special thanks to all the scientis that

made public available their presentations

throughout the web from where many

slides were taken to eleborate this

presentation

Web sites used in our practice

Sequence Retrieval system

RSA Tools

BLAST

Figures are linked to their corresponding web sites

ClustalW Structural Criteria

Residues are arranged so that those playing a similar role end up in the same column.

Evolutive Criteria Residues are arranged so that those having the same ancestor

end up in the same column.

Similarity Criteria As many similar residues as possible in the same column

What is a Multiple Sequence Alignment?

Page 2: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 2

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

Multiple sequence alignments

Seems a simple extension: Align k sequences at the same time.

Unfortunately, this can get very expensive. For more than eight

proteins of average length, the problem is non-computable given

current computer power. Therefore, all of the methods capable of

handling larger problems in practical timescales make use of

“heuristics”.

Aligning N sequences of length L requires a matrix of size LN,

where each square in the matrix has 2N-1 neighbors

This gives a total time complexity of O(2N LN)

Multiple sequence alignments

The MSA contains what you put inside…

You can view your MSA as:

A record of evolution

A summary of a protein family

A collection of experiments made for you by Nature…

a MSA is a MODEL

What is a Multiple Sequence Alignment?

2,0

00,0

00,0

00 y

ears

Page 3: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 3

What Is A Multiple Sequence Alignment?

It Indicates the RELATIONSHIP between residues

of different sequences.

It REVEALS

-Similarities

-Inconsistencies

Multiple Alignments are

CENTRAL to MOST

Bioinformatics Techniques.

Why Is It Difficult To Compute A multiple Sequence Alignment?

A CROSSROAD PROBLEM

BIOLOGY:What is A Good

Alignment

COMPUTATIONWhat is THE Good

Alignment

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

Why Is It Difficult To Compute A multiple Sequence Alignment ?

BIOLOGY

CIRCULAR PROBLEM....

GoodSequences

GoodAlignment

COMPUTATION

Multiple Alignments:

What Are They Good For?

Page 4: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 4

Profile(gapped

weight matrix)

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

RWDAGCVN

RWDSGCVN

RWHHGCVQ

RWKGACYN

RWLWACEQ

R-W-x(2)-[AG]-C-x-[NQ]

Regular expression (pattern)

Position-specific weight matrices (blocks)

Hidden Markov model (HMM)

Frequency (identity) matrices

fingerprint

motif

Multiple Sequence Alignment Derived Information

Simultaneous As opposed to Progressive

Exact As opposed to Heursistic

Stochastic As opposed to Determinist

Iterative As opposed to Non Iterative

[Simultaneous: they simultaneously use all the information]

[Heuristics: cut corners like Blast Vs SW]

[Heuristics: do not guarranty an optimal solution]

[Stochastic: contain an element of randomness]

[Stochastic: Example of a Monte Carlo Surface estimation ]

[Iterative: Most stochastic methods are iterative]

[Iterative: run the same algorithm many times]

Multiple Sequence Alignment clasification

Correct

according to

optimality

criteria

Correct

according to

homology

Exhaustive

methods

Always Not always

Heuristic

methods

Not always Not always

“The Correct Alignment”

Iterative

Iteralign

Prrp

SAM HMMer

SAGAGA

Clustal

Dialign

T-Coffee

ProgressiveSimultaneous

MSA

POA OMA

PralineMAFFT

DCA

Combalign

Non tree based

GAs

HMMs

Page 5: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 5

Iterative

Iteralign

Prrp

SAM HMMer

GA

Clustal

Dialign

T-Coffee

ProgressiveSimultaneous

MSA

POA OMA

PralineMAFFT

DCA

Combalign

StochasticSAGA

In any case, MSA consider the evolution of

each column as independent process

How close to reality is this assumption?

3D protein models can be evaluated based on

the co-evolution of their interacting residues

A

A B

B

A B

A

A

B

B

The presence of 'correlated positions' between pairs

of positions in pairs of multiple sequence alignments

can be used in predicting intra-protein and protein-

protein interactions.

Page 6: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 6

Julie D.Thompson, Desmond

G.Higgins and Toby J.Gibson.

Nucleic Acids Research, 1994, Vol.

22, No. 22 4673-4680

SeqA Name Len(aa) SeqB Name Len(aa) Identity

1 human 60 2 dog 60 77%

1 human 60 3 mouse 59 61%

2 dog 60 3 mouse 59 52%

human EYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTEN 480

Dog EYSGSSEKIDLMASDPQDAFICESERVHTKPVGGNIEDKIFGKTYRRKASLPKVSHTTEV 477

mouse GGFSSSRKTDLVTPDPHHTLMCKSGRDFSKPVEDNISDKIFGKSYQRKGSRPHLNHVTE 476

Step 1 – Pairwise Alignment. Compare each sequence with

each other calculate a distance matrix

Multiple sequence alignments. Clustal W

Compare each sequence with each other calculate a distance matrix

Step 1 – Pairwise Alignment. Compare each sequence with

each other calculate a distance matrix

Distance = Number of exact

matches divided by the sequence

length (ignoring gaps). Thus, the

higher the number the more closely

related the two sequences are.

In this distance matrix, the sequence of Human is 76% identical to the

sequence of Dog

--

-

0.76

0.61 0.52

H

D

M

H D M

Multiple sequence alignments. Clustal W

Different

sequences

Page 7: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 7

Step 2 – Create Guide Tree. Use the results of the distance

matrix to create a Guide Tree to help determine in what order the

sequences will be aligned.

Guide Tree, or Dendrogram has no phylogenetic meaning

Cannot be used to show evolutionary relationships

--

-

0.76

0.61 0.52

H

D

M

H D M

H

D

M

Guide Tree

Multiple sequence alignments. Clustal W

Initially the guide Trees were calculated using the UPGMA method.

The current version uses the Neighbour-Joining method which gives better

estimates of individual branch lengths.

Step 3 – Progressive Alignment

Follow the Guide Tree and align the sequences

Align the most closely related sequences first, then add in the more

distantly related ones and align them to the existing alignment,

inserting gaps if necessary

A

B

C

1. Align Human and Dog first

2. Add sequence Mouse to the previous alignment

of Human and Dog

Multiple sequence alignments. Clustal W

By the time the most distantly related sequences are

aligned, one already has a sample of aligned sequences

which gives important information about the variability at

each position

Multiple sequence alignments. Clustal W

Short stretches of 5 hydrophilic residues often indicate loop or random coil regions (not essential for structure) and therefore gap penalties are reduced for such stretches.

Gap penalties for closely related sequences are lowered compared to more distantly related sequences (“once a gap always a gap” rule). It is thought that those gaps occur in regions that do not disrupt the structure or function.

Alignments of proteins of known structure show that proteins gaps do not occur more frequently than every eight residues. Therefore penalties for gaps increase when required at 8 residues or less for alignment. This gives a lower alignment score in that region.

A gap weight is assigned after each aa according the frequency that such a gap naturally occurs after that aa in nature

Multiple sequence alignments. Clustal WGap treatment

Page 8: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 8

As we know, there are many scoring matrices that

one can use depending on the relatedness of the

aligned proteins.

As the alignment proceeds to longer branches the

aa scoring matrices are changed to accommodate

more divergent sequences. The length of the

branch is used to determine which matrix to use.

Similar sequences with "hard" matrices (BLOSUM80)

Distant sequences with "soft" matrices (BLOSUM50)

Amino acid weight matrices

Multiple sequence alignments. Clustal W

Relative contribution of each pairwise alignment to the

global alignment score

Sequences are weighted to compensate for bias of redundant elements in the alignment

Multiple sequence alignments. Clustal W

Pairwise Alignment: Calculation of distance matrix

Creation of unrooted Neighbor-Joining Tree

Rooted NJ Tree (guide tree) and calculation of sequence weights

Progressive alignment following the Guide Tree

Flowchart of computation steps in Clustalhttp://www.clustal.org/

Page 9: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 9

http://www.ch.embnet.org/software/ClustalW.htmlClustalW Multiple sequence alignments. TCoffee

Regular progressive alignment strategy may produce

alignment errors

Multiple sequence alignments. TCoffee

Local Alignment Global Alignment

Extension

Library Based Multiple Sequence Alignment

Multiple Sequence Alignment

The global

alignments are

constructed using

ClustalW on the

sequences, two at a

time

The local alignments are

the ten top scoring non-

intersecting local

alignments, between

each pair of sequences,

gathered using the

Lalign program (which

is a variant of the Smith

and Waterman Method)

of the FASTA package

T-Coffee: Mixing Local and Global Alignments

Page 10: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 10

T-Coffee: Primary Library

In the library, each

alignment is represented as

a list of pair-wise residue

matches, each of these pairs

is a constraint.

All of these constraints are

not equally important.

This data is taken into

account when computing the

multiple alignment and give

priority to the most reliable

residue pairs

T-Coffee: Analysis of Consistency

For each pair of aligned residues in the library, we can assign a

weight that reflects the degree to which those residues align

consistently with residues

We enormously increase the

value of the information in

the library by examining the

consistency of each pair of

residues with residue pairs

from all of the other

alignments.

The Triplet Assumption

X

Y

Z

X

Y

SEQ A

SEQ B

Consistency ConsensusClustalW T-Coffee

Page 11: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 11

Dynamic Programming Using An Extended Library

Progressive Alignment

T-Coffee: Alignmed sequences using Extend Library

Each Library Line is a Soft Constraint (a wish)

You can’t satisfy them all

You must satisfy as many as possible (The easy

ones)

T-Coffee and Concistency…

Local Alignment Global Alignment

Multiple Sequence Alignment

Multiple Alignment

StructuralSpecialist

Mixing Heterogenous Data With T-Coffee

Page 12: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 12

•The method is broadly based on the popular progressive approach to

multiple alignment but avoids the most serious pitfalls caused by the

greedy nature of this algorithm.

• With T-Coffee we pre-process a data set of all pair-wise alignments

between the sequences.

•This provides us with a library of alignment information that can be used

to guide the progressive alignment.

•Intermediate alignments are then based not only on the sequences to be

aligned next but also on how all of the sequences align with each other.

• This alignment information can be derived from heterogeneous sources

such as a mixture of alignment programs and/or structure superposition.

T-Coffee and Consistency…

(Summary) Why Do We Want To Mix Sequences and Structures?

•Sequences are Cheap and Common.

•Structures are Expensive and Rare.

3D-Coffee

Cheapest Structure determination:

Sequence-Structure Alignment

THREADOr

ALIGN

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

Why Do We Want To Mix Sequences and Structures?

3D-Coffee

Convincing Alignment

Same Fold

Distant sequences are

hard to align

Why Do We Want To Mix Sequences and Structures?

3D-Coffee

Page 13: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 13

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD

wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE

trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP

mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

Multiple Sequence Alignments

Help

Exploring the Twilight Zone

Why Do We Want To Mix Sequences and Structures?

3D-Coffee

Structure

Superposition

Why Do We Want To Mix Sequences and Structures?

3D-Coffee

-Structures Help BUT NOT SO MUCH

Conclusion

Why Do We Want To Mix Sequences and Structures?

3D-Coffee http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html

Page 14: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 14

http://www.tcoffee.org/ MUSCLEMultiple Sequence Alignment with reduced time and space complexity

Basic Strategy: A progressive alignment is built, to

which horizontal refinement is applied

Three stages

At end of each stage, a multiple alignment is

available and the algorithm can be terminated

MUSCLEMultiple Sequence Alignment with reduced time and space complexity

MUSCLEMultiple Sequence Alignment with reduced time and space complexity

Draft

Progressive

Improved

Progressive

Refinement

Page 15: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 15

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 1. Draft Progressive

1.1. Similarity measure and Distance estimate

Calculated using k-mer counting.

A kmer is a contiguous subsequence of length k, also known as a word or k-tuple.

Related sequences tend to have more kmers in common than expected by chance

ACCATGCGAATGGTCCACAATG

k-mer:

ATG

CCA

score:

3

2

1.1. Similarity measure and Distance estimate

The goal of the first stage is to produce a multiple alignment,

emphasizing speed over accuracy

MUSCLE. Stage 1. Draft Progressive

Based on the pairwise similarities, a triangular

distance matrix is computed.

XXXX

0.6XXX

0.80.2XX

0.30.70.5X

X1 X2 X3

X4

X3

X2

X1

X4

MUSCLE. Stage 1. Draft Progressive

1.1. Similarity measure and Distance estimate

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 1. Draft Progressive

1.2. Tree construction using UPGMA

Page 16: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 16

From the distance matrix we construct a tree using the

UPGM method

XXXX

0.6XXX

0.80.2XX

0.30.70.5X

X1

X1

X2 X3 X4

X2

X3

X4

X1

X4

X2

X3

X3X2X4X1

X3X2

X4X1

MUSCLE. Stage 1. Draft Progressive

1.2. Tree construction using UPGMA

One of the fastest and tree construction methods

Is a simple agglomerative or bottom-up data clustering

method

UPGMA assumes a constant rate of evolution (molecular

clock hypothesis).

At each step, the nearest 2 clusters are combined into a

higher-level cluster. The distance between any 2 clusters A

and B is taken to be the average of all distances between

pairs of objects "a" in A and "b" in B.

(Unweighted Pair Group Method with Arithmetic mean)

MUSCLE. Stage 1. Draft Progressive

1.2. Tree construction using UPGMA

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 1. Draft Progressive

1.3. Progressive alignment.

A progressive alignment is built by following the branching order of the

tree, yielding a multiple alignment of all input sequences at the root.

The alignment is done by profiles

Profile-profile alignment

MUSCLE. Stage 1. Draft Progressive

1.3. Progressive alignment.

Page 17: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 17

Progressive alignment

MUSCLE. Stage 1. Draft Progressive

A progressive alignment is built by following the branching order of the

tree, yielding a multiple alignment of all input sequences at the root.

The alignment is done by profiles

1.3. Progressive alignment.

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 2. Improved Progressive

2.1. Similarity measure and Distance estimate

The main source of error in the draft progressive stage is the

approximate kmer distance measure, which results in a

suboptimal tree.

MUSCLE therefore re-estimates the tree using the Kimura

distance, which is more accurate but requires an alignment

MUSCLE. Stage 2. Improved Progressive

2.1. Similarity measure and Distance estimate

MUSCLE. Stage 2. Improved Progressive

2.1. Similarity measure and Distance estimate

Page 18: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 18

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 2. Improved Progressive

2.2. Tree construction using UPGMA

A tree is constructed by computing a Kimura distance

matrix and applying a clustering method to it

XXXX

XXX

XX

X

X1 X2 X3

X4

X3

X2

X1

X4

MUSCLE. Stage 2. Improved Progressive

2.2. Tree construction using UPGMA

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 2. Improved Progressive

2.3. Progressive alignment

A new progressive alignment is built

X2

X4

X1

X3X3

X3

X1

X1X4

X4X2

X2

New Alignment

MUSCLE. Stage 2. Improved Progressive

2.3. Progressive alignment

Page 19: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 19

The new tree is compared to the previous tree by identifying the set of

internal nodes for which the branching order has changed.

If Stage 2 has executed more than once, and the number of changed

nodes has not decreased, the process of improving the tree is considered

to have converged and iteration terminates.

MUSCLE. Stage 2. Improved Progressive

2.4. Tree comparison

Refinement is performed iteratively

MUSCLE. Stage 3. Refinement.

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 3. Refinement.

3.1. Delete edge from the Tree.

Choice of bipartition

An edge is removed from the tree, dividing the sequences

into two disjoint subsets

X5

X4

X2

X3

X1

X5

X1

X4

X2

X3

MUSCLE. Stage 3. Refinement.

3.1. Delete edge from the Tree.

Page 20: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 20

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 3. Refinement.

3.2. Compute subtree profiles.

The multiple alignment of each subset is extracted from

current multiple alignment. Columns made up of indels

only are removed

TCC--AATCA--GATCA--AAG--ATACT--CTGC

X1

X3

X4

X5

X2

TCC--AATCA--AA

TCA--GA

T--CTGCG--ATAC

TCCAATCAAA

MUSCLE. Stage 3. Refinement.

3.2. Compute subtree profiles.

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 3. Refinement.

3.3. Re-align profiles.

The two profiles are then realigned with each other using

profile-profile alignment.

TCA--GA

T--CTGCG--ATAC

T--CCAAT--CAAATCA--GA

T--CTGCG--ATAC

TCCAATCAAA

MUSCLE. Stage 3. Refinement.

3.3. Re-align profiles.

Page 21: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 21

Draft

Progressive

Improved

Progressive

Refinement

MUSCLE. Stage 3. Refinement.

3.4. Accept/Reject.

The score of the new alignment is computed, if the score is

higher than the old alignment, the new alignment is

retained, otherwise it is discarded.

T--CCAAT--CAAATCA--GA

T--CTGCG--ATAC

TCC--AATCA--GATCA--AAG--ATACT--CTGC

New Old

OR

MUSCLE. Stage 3. Refinement.

3.4. Accept/Reject.

1234

ACGT match=1

ACGA mismatch=0

AGGA

1: A-A + A-A + A-A = 1+1+1 = 3

2: C-C + C-G + C-G =1+0+0 = 1

3: G-G + G-G + G-G = 1+1+1 = 3

4: T-A + T-A + A-A = 0+0+1 =1

S(alignment) = S(1) + S(2) + S(3) + S(4) = 3+1+3+1 = 8

The higher the score, the better the alignment

MUSCLE. Stage 3. Refinement.

3.4. Accept/Reject.

Score of alignment

MUSCLEMultiple Sequence Alignment with reduced time and space complexity

MUSCLE

T-coffee

Page 22: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 22

An incorrect conclusion may come from a sequence

alignment using incorrect assumptions

Supose you want to

align a set of MtrB

sequences retrived by

gene name

fromNCBI

Identification of TRAP orthologs as an example

of the risk of common mistakes in the analysis

Computatonal Genomic group

Angela Valbuzzi and Charles Yanofsky. SCIENCE VOL 293 14 SEPTEMBER 2001

Biología Computacional

TRAP is form of 11 identical subunits

Insted of Ribosome, attenuation is mediated by an RNA binding

protein called TRAP (trp RNA-Binding Attenuation Protein )

Biología Computacional

Secuencias TRAP(TRyptophan Attenuation Protein)

In Bacillus subtilis the trp operon is also regulated by

transcription attenuation

Page 23: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 23

Supose you want to

align a set of MtrB

sequences retrived by

gene name

fromNCBI

An incorrect conclusion may come from a sequence

alignment using incorrect assumptions

MtrB [Desulfobacterium autotrophicum HRM2]

Signal transduction histidine kinase, nitrate/nitrite-specific

MtrB [Bacillus amyloliquefaciens FZB42]

Tryptophan RNA-binding attenuator protein

An incorrect conclusion may come from a sequence

alignment using incorrect assumptions

Never forget that MSA is just a model that

performs on a set of sequences given by the user

Page 24: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 24

Multiple sequence alignment

Exercise:

Use multiple sequence alignment to analyze how

our model antiTRAp align with their

corresponding likely long distant homologs.

Sequence search

based on

antiTRAP

Protein sequence

Use multiple sequence alignment to analyze how antiTRAp

align with their corresponding likely long distant homologs

Use multiple sequence

alignment to analyze

how antiTRAp align

with their corresponding

likely long distant

homologs

Use multiple sequence alignment to analyze how antiTRAp

align with their corresponding likely long distant homologs

Page 25: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 25

>gi|16077322|ref|NP_388135.1| hypothetical protein BSU02530 [Bacillus subtilis subsp.

subtilis str. 168]

MVIATDDLEVACPKCERAGEIEGTPCPACSGKGVILTAQGYTLLDFIQKHLNK

>gi|52078761|ref|YP_077552.1| inhibitor of TRAP, regulated by T-box (trp) sequence RtpA

[Bacillus licheniformis ATCC 14580]

MVIATDDLETTCPNCNGSGREEPEPCPKCSGKGVILTAQGSTLLHFIKKHLNE

>gi|154684753|ref|YP_001419914.1| RtpA [Bacillus amyloliquefaciens FZB42]

MTGDGQTIKKGGIFMVIATDDLELTCPHCEGTGEEKEGTPCPKCGAKGVILTAQGNTLLHFIRKHIDQ

>gi|116749904|ref|YP_846591.1| hypothetical protein Sfum_2476 [Syntrophobacter

fumaroxidans MPOB]

MVRMRLPELETKCWMCWGSGKIASEDHGGGMECPECGGVGWLPTADGRRLLDFVQRHLGIVEEGEDNETL

>gi|221194637|ref|ZP_03567694.1| chaperone protein DnaJ [Atopobium rimae ATCC 49626]

MASMNEKDYYVILEVSETATTEEIRKAFQVKARKLHPDVNKAPDAEARFKEVSEAYAVLSDEGKRRRYDA

MRSGNPFAGGYGPSGSPAGSNSYGQDPFGWGFPFGGVDFSSWRSQGSRRSRAYKPQTGADIEYDLTLTPM

QAQEGVRKGITYQRFSACEACHGSGSVHHSEASSTCPTCGGTGHIHVDLSGIFGFGTVEMECPECEGTGH

VVADPCEACGGSGRVLSASEAVVNVPPHAHDGDEIRMEGKGNAGTNGSKTGDFVVRVRVPEEQVTLRQSM

GARAIGIALPFFAVDLATGASLLGTIIVAMLVVFGVRNIVGDGIKRSQRWWRNLGYAVVNGALTGIAWAL

VAYMFFSCTAGLGRW

>gi|224372791|ref|YP_002607163.1| chaperone protein DnaJ [Nautilia profundicola AmH]

MDYYEILGVERTATKVEIKKAYRKLAMKYHPDKNPGDKEAEEMFKKINEAYQVLSDDEKRAIYDKYGKEG

LEGQGFKTDFDFGDIFDMFNDIFGGGFGGGRAEVQMPYDIDKAIEVTLEFEEAVYGVSKEIEINYFKLCP

KCKGSGAEEKETCPSCHGRGTIIMGNGFMRISQTCPQCSGRGFIAKKVCNECRGKGYIVESETVKVDIPA

GIDTGMRMRVKGRGNQDISGYRGDLYLIFNVKESKIFKRKGNNLIVEVPIFFTSAILGDTVKIPTLSGEK

EIEIKPHTKDNTKIVFRGEGIADPNTGYRGDLIAILKIVYPKKLTDEQRELLEKLHKSFGGEIKEHKSIL

EEAIDKVKSWFKGS

>gi|57867036|ref|YP_188723.1| dnaJ protein [Staphylococcus epidermidis RP62A]

MAKRDYYEVLGVNKSASKDEIKKAYRKLSKKYHPDINKEEGADEKFKEISEAYEVLSDENKRVNYDQFGH

DGPQGGFGSQGFGGSDFGGFEDIFSSFFGGGSRQRDPNAPRKGDDLQYTMTITFEEAVFGTKKEISIKKD

VTCHTCNGDGAKPGTSKKTCSYCNGAGRVSVEQNTILGRVRTEQVCPKCEGSGQEFEEPCPTCKGKGTEN

KTVKLEVTVPEGVDNEQQVRLAGEGSPGVNGGPHGDLYVVFRVKPSNTFERDGDDIYYNLDISFSQAALG

DEIKIPTLKSNVVLTIPAGTQTGKQFRLKDKGVKNVHGYGYGDLFVNIKVVTPTKLNDRQKELLKEFAEI

NGENINEQSSNFKDRAKRFFKGE

Use multiple sequence alignment to analyze how antiTRAp

align with their corresponding likely long distant homologs Use multiple sequence alignment to analyze how antiTRAp

align with their corresponding likely long distant homologs

UPGMA

Unweighted Pair Group Method with Arithmetic mean

One of the fastest and tree construction methods

Used in Pileup (GCG package)

Clustal uses neighbor joining, but calculating NJ tree is much more demanding; thus, UPGMA is demonstrated here

UPGMA tree

Page 26: Multiple sequence alignmentsvinuesa/tlem/docs/Multiple... · Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-UNAM, México. 17-28 de Enero

Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-

UNAM, México. 17-28 de Enero de 2011

Enrique Merino, IBT-UNAM 26

Constructing MSA

human ACGTACGTCC

chimp ACCTACGTCC

gorilla ACCACCGTCC

orangutan ACCCCCCTCC

maqaque CCCCCCCCCC

human ACGTACGTCC

chimp ACCTACGTCC

gorilla ACCACCGTCC

orangutan ACCCCCCTCC

human ACGTACGTCC

chimp ACCTACGTCC

gorilla ACCACCGTCC

orangutan ACCCCCCTCC

Similarity Measure

Similarity is calculated for each pair of sequences using

fractional identity computed from their mutual alignment

in the current multiple alignment

TCC--AATCA--GATCA--AAG--ATACT--CTGC

TCC--AATCA--AA

MUSCLE.

Stage 2: Improved Progressive

MUSCLEMultiple Sequence Alignment with reduced time and space complexity