multiple sequence alignmentsvinuesa/tlem/docs/multiple... · alineamientos múltiples de secuencias...
TRANSCRIPT
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 1
Multiple sequence alignments
Special thanks to all the scientis that
made public available their presentations
throughout the web from where many
slides were taken to eleborate this
presentation
Web sites used in our practice
Sequence Retrieval system
RSA Tools
BLAST
Figures are linked to their corresponding web sites
ClustalW Structural Criteria
Residues are arranged so that those playing a similar role end up in the same column.
Evolutive Criteria Residues are arranged so that those having the same ancestor
end up in the same column.
Similarity Criteria As many similar residues as possible in the same column
What is a Multiple Sequence Alignment?
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 2
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
x
y
z
Multiple sequence alignments
Seems a simple extension: Align k sequences at the same time.
Unfortunately, this can get very expensive. For more than eight
proteins of average length, the problem is non-computable given
current computer power. Therefore, all of the methods capable of
handling larger problems in practical timescales make use of
“heuristics”.
Aligning N sequences of length L requires a matrix of size LN,
where each square in the matrix has 2N-1 neighbors
This gives a total time complexity of O(2N LN)
Multiple sequence alignments
The MSA contains what you put inside…
You can view your MSA as:
A record of evolution
A summary of a protein family
A collection of experiments made for you by Nature…
a MSA is a MODEL
What is a Multiple Sequence Alignment?
2,0
00,0
00,0
00 y
ears
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 3
What Is A Multiple Sequence Alignment?
It Indicates the RELATIONSHIP between residues
of different sequences.
It REVEALS
-Similarities
-Inconsistencies
Multiple Alignments are
CENTRAL to MOST
Bioinformatics Techniques.
Why Is It Difficult To Compute A multiple Sequence Alignment?
A CROSSROAD PROBLEM
BIOLOGY:What is A Good
Alignment
COMPUTATIONWhat is THE Good
Alignment
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
Why Is It Difficult To Compute A multiple Sequence Alignment ?
BIOLOGY
CIRCULAR PROBLEM....
GoodSequences
GoodAlignment
COMPUTATION
Multiple Alignments:
What Are They Good For?
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 4
Profile(gapped
weight matrix)
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
RWDAGCVN
RWDSGCVN
RWHHGCVQ
RWKGACYN
RWLWACEQ
R-W-x(2)-[AG]-C-x-[NQ]
Regular expression (pattern)
Position-specific weight matrices (blocks)
Hidden Markov model (HMM)
Frequency (identity) matrices
fingerprint
motif
Multiple Sequence Alignment Derived Information
Simultaneous As opposed to Progressive
Exact As opposed to Heursistic
Stochastic As opposed to Determinist
Iterative As opposed to Non Iterative
[Simultaneous: they simultaneously use all the information]
[Heuristics: cut corners like Blast Vs SW]
[Heuristics: do not guarranty an optimal solution]
[Stochastic: contain an element of randomness]
[Stochastic: Example of a Monte Carlo Surface estimation ]
[Iterative: Most stochastic methods are iterative]
[Iterative: run the same algorithm many times]
Multiple Sequence Alignment clasification
Correct
according to
optimality
criteria
Correct
according to
homology
Exhaustive
methods
Always Not always
Heuristic
methods
Not always Not always
“The Correct Alignment”
Iterative
Iteralign
Prrp
SAM HMMer
SAGAGA
Clustal
Dialign
T-Coffee
ProgressiveSimultaneous
MSA
POA OMA
PralineMAFFT
DCA
Combalign
Non tree based
GAs
HMMs
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 5
Iterative
Iteralign
Prrp
SAM HMMer
GA
Clustal
Dialign
T-Coffee
ProgressiveSimultaneous
MSA
POA OMA
PralineMAFFT
DCA
Combalign
StochasticSAGA
In any case, MSA consider the evolution of
each column as independent process
How close to reality is this assumption?
3D protein models can be evaluated based on
the co-evolution of their interacting residues
A
A B
B
A B
A
A
B
B
The presence of 'correlated positions' between pairs
of positions in pairs of multiple sequence alignments
can be used in predicting intra-protein and protein-
protein interactions.
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 6
Julie D.Thompson, Desmond
G.Higgins and Toby J.Gibson.
Nucleic Acids Research, 1994, Vol.
22, No. 22 4673-4680
SeqA Name Len(aa) SeqB Name Len(aa) Identity
1 human 60 2 dog 60 77%
1 human 60 3 mouse 59 61%
2 dog 60 3 mouse 59 52%
human EYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTEN 480
Dog EYSGSSEKIDLMASDPQDAFICESERVHTKPVGGNIEDKIFGKTYRRKASLPKVSHTTEV 477
mouse GGFSSSRKTDLVTPDPHHTLMCKSGRDFSKPVEDNISDKIFGKSYQRKGSRPHLNHVTE 476
Step 1 – Pairwise Alignment. Compare each sequence with
each other calculate a distance matrix
Multiple sequence alignments. Clustal W
Compare each sequence with each other calculate a distance matrix
Step 1 – Pairwise Alignment. Compare each sequence with
each other calculate a distance matrix
Distance = Number of exact
matches divided by the sequence
length (ignoring gaps). Thus, the
higher the number the more closely
related the two sequences are.
In this distance matrix, the sequence of Human is 76% identical to the
sequence of Dog
--
-
0.76
0.61 0.52
H
D
M
H D M
Multiple sequence alignments. Clustal W
Different
sequences
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 7
Step 2 – Create Guide Tree. Use the results of the distance
matrix to create a Guide Tree to help determine in what order the
sequences will be aligned.
Guide Tree, or Dendrogram has no phylogenetic meaning
Cannot be used to show evolutionary relationships
--
-
0.76
0.61 0.52
H
D
M
H D M
H
D
M
Guide Tree
Multiple sequence alignments. Clustal W
Initially the guide Trees were calculated using the UPGMA method.
The current version uses the Neighbour-Joining method which gives better
estimates of individual branch lengths.
Step 3 – Progressive Alignment
Follow the Guide Tree and align the sequences
Align the most closely related sequences first, then add in the more
distantly related ones and align them to the existing alignment,
inserting gaps if necessary
A
B
C
1. Align Human and Dog first
2. Add sequence Mouse to the previous alignment
of Human and Dog
Multiple sequence alignments. Clustal W
By the time the most distantly related sequences are
aligned, one already has a sample of aligned sequences
which gives important information about the variability at
each position
Multiple sequence alignments. Clustal W
Short stretches of 5 hydrophilic residues often indicate loop or random coil regions (not essential for structure) and therefore gap penalties are reduced for such stretches.
Gap penalties for closely related sequences are lowered compared to more distantly related sequences (“once a gap always a gap” rule). It is thought that those gaps occur in regions that do not disrupt the structure or function.
Alignments of proteins of known structure show that proteins gaps do not occur more frequently than every eight residues. Therefore penalties for gaps increase when required at 8 residues or less for alignment. This gives a lower alignment score in that region.
A gap weight is assigned after each aa according the frequency that such a gap naturally occurs after that aa in nature
Multiple sequence alignments. Clustal WGap treatment
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 8
As we know, there are many scoring matrices that
one can use depending on the relatedness of the
aligned proteins.
As the alignment proceeds to longer branches the
aa scoring matrices are changed to accommodate
more divergent sequences. The length of the
branch is used to determine which matrix to use.
Similar sequences with "hard" matrices (BLOSUM80)
Distant sequences with "soft" matrices (BLOSUM50)
Amino acid weight matrices
Multiple sequence alignments. Clustal W
Relative contribution of each pairwise alignment to the
global alignment score
Sequences are weighted to compensate for bias of redundant elements in the alignment
Multiple sequence alignments. Clustal W
Pairwise Alignment: Calculation of distance matrix
Creation of unrooted Neighbor-Joining Tree
Rooted NJ Tree (guide tree) and calculation of sequence weights
Progressive alignment following the Guide Tree
Flowchart of computation steps in Clustalhttp://www.clustal.org/
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 9
http://www.ch.embnet.org/software/ClustalW.htmlClustalW Multiple sequence alignments. TCoffee
Regular progressive alignment strategy may produce
alignment errors
Multiple sequence alignments. TCoffee
Local Alignment Global Alignment
Extension
Library Based Multiple Sequence Alignment
Multiple Sequence Alignment
The global
alignments are
constructed using
ClustalW on the
sequences, two at a
time
The local alignments are
the ten top scoring non-
intersecting local
alignments, between
each pair of sequences,
gathered using the
Lalign program (which
is a variant of the Smith
and Waterman Method)
of the FASTA package
T-Coffee: Mixing Local and Global Alignments
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 10
T-Coffee: Primary Library
In the library, each
alignment is represented as
a list of pair-wise residue
matches, each of these pairs
is a constraint.
All of these constraints are
not equally important.
This data is taken into
account when computing the
multiple alignment and give
priority to the most reliable
residue pairs
T-Coffee: Analysis of Consistency
For each pair of aligned residues in the library, we can assign a
weight that reflects the degree to which those residues align
consistently with residues
We enormously increase the
value of the information in
the library by examining the
consistency of each pair of
residues with residue pairs
from all of the other
alignments.
The Triplet Assumption
X
Y
Z
X
Y
SEQ A
SEQ B
Consistency ConsensusClustalW T-Coffee
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 11
Dynamic Programming Using An Extended Library
Progressive Alignment
T-Coffee: Alignmed sequences using Extend Library
Each Library Line is a Soft Constraint (a wish)
You can’t satisfy them all
You must satisfy as many as possible (The easy
ones)
T-Coffee and Concistency…
Local Alignment Global Alignment
Multiple Sequence Alignment
Multiple Alignment
StructuralSpecialist
Mixing Heterogenous Data With T-Coffee
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 12
•The method is broadly based on the popular progressive approach to
multiple alignment but avoids the most serious pitfalls caused by the
greedy nature of this algorithm.
• With T-Coffee we pre-process a data set of all pair-wise alignments
between the sequences.
•This provides us with a library of alignment information that can be used
to guide the progressive alignment.
•Intermediate alignments are then based not only on the sequences to be
aligned next but also on how all of the sequences align with each other.
• This alignment information can be derived from heterogeneous sources
such as a mixture of alignment programs and/or structure superposition.
T-Coffee and Consistency…
(Summary) Why Do We Want To Mix Sequences and Structures?
•Sequences are Cheap and Common.
•Structures are Expensive and Rare.
3D-Coffee
Cheapest Structure determination:
Sequence-Structure Alignment
THREADOr
ALIGN
ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN
Why Do We Want To Mix Sequences and Structures?
3D-Coffee
Convincing Alignment
Same Fold
Distant sequences are
hard to align
Why Do We Want To Mix Sequences and Structures?
3D-Coffee
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 13
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
Multiple Sequence Alignments
Help
Exploring the Twilight Zone
Why Do We Want To Mix Sequences and Structures?
3D-Coffee
Structure
Superposition
Why Do We Want To Mix Sequences and Structures?
3D-Coffee
-Structures Help BUT NOT SO MUCH
Conclusion
Why Do We Want To Mix Sequences and Structures?
3D-Coffee http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 14
http://www.tcoffee.org/ MUSCLEMultiple Sequence Alignment with reduced time and space complexity
Basic Strategy: A progressive alignment is built, to
which horizontal refinement is applied
Three stages
At end of each stage, a multiple alignment is
available and the algorithm can be terminated
MUSCLEMultiple Sequence Alignment with reduced time and space complexity
MUSCLEMultiple Sequence Alignment with reduced time and space complexity
Draft
Progressive
Improved
Progressive
Refinement
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 15
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 1. Draft Progressive
1.1. Similarity measure and Distance estimate
Calculated using k-mer counting.
A kmer is a contiguous subsequence of length k, also known as a word or k-tuple.
Related sequences tend to have more kmers in common than expected by chance
ACCATGCGAATGGTCCACAATG
k-mer:
ATG
CCA
score:
3
2
1.1. Similarity measure and Distance estimate
The goal of the first stage is to produce a multiple alignment,
emphasizing speed over accuracy
MUSCLE. Stage 1. Draft Progressive
Based on the pairwise similarities, a triangular
distance matrix is computed.
XXXX
0.6XXX
0.80.2XX
0.30.70.5X
X1 X2 X3
X4
X3
X2
X1
X4
MUSCLE. Stage 1. Draft Progressive
1.1. Similarity measure and Distance estimate
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 1. Draft Progressive
1.2. Tree construction using UPGMA
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 16
From the distance matrix we construct a tree using the
UPGM method
XXXX
0.6XXX
0.80.2XX
0.30.70.5X
X1
X1
X2 X3 X4
X2
X3
X4
X1
X4
X2
X3
X3X2X4X1
X3X2
X4X1
MUSCLE. Stage 1. Draft Progressive
1.2. Tree construction using UPGMA
One of the fastest and tree construction methods
Is a simple agglomerative or bottom-up data clustering
method
UPGMA assumes a constant rate of evolution (molecular
clock hypothesis).
At each step, the nearest 2 clusters are combined into a
higher-level cluster. The distance between any 2 clusters A
and B is taken to be the average of all distances between
pairs of objects "a" in A and "b" in B.
(Unweighted Pair Group Method with Arithmetic mean)
MUSCLE. Stage 1. Draft Progressive
1.2. Tree construction using UPGMA
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 1. Draft Progressive
1.3. Progressive alignment.
A progressive alignment is built by following the branching order of the
tree, yielding a multiple alignment of all input sequences at the root.
The alignment is done by profiles
Profile-profile alignment
MUSCLE. Stage 1. Draft Progressive
1.3. Progressive alignment.
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 17
Progressive alignment
MUSCLE. Stage 1. Draft Progressive
A progressive alignment is built by following the branching order of the
tree, yielding a multiple alignment of all input sequences at the root.
The alignment is done by profiles
1.3. Progressive alignment.
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 2. Improved Progressive
2.1. Similarity measure and Distance estimate
The main source of error in the draft progressive stage is the
approximate kmer distance measure, which results in a
suboptimal tree.
MUSCLE therefore re-estimates the tree using the Kimura
distance, which is more accurate but requires an alignment
MUSCLE. Stage 2. Improved Progressive
2.1. Similarity measure and Distance estimate
MUSCLE. Stage 2. Improved Progressive
2.1. Similarity measure and Distance estimate
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 18
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 2. Improved Progressive
2.2. Tree construction using UPGMA
A tree is constructed by computing a Kimura distance
matrix and applying a clustering method to it
XXXX
XXX
XX
X
X1 X2 X3
X4
X3
X2
X1
X4
MUSCLE. Stage 2. Improved Progressive
2.2. Tree construction using UPGMA
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 2. Improved Progressive
2.3. Progressive alignment
A new progressive alignment is built
X2
X4
X1
X3X3
X3
X1
X1X4
X4X2
X2
New Alignment
MUSCLE. Stage 2. Improved Progressive
2.3. Progressive alignment
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 19
The new tree is compared to the previous tree by identifying the set of
internal nodes for which the branching order has changed.
If Stage 2 has executed more than once, and the number of changed
nodes has not decreased, the process of improving the tree is considered
to have converged and iteration terminates.
MUSCLE. Stage 2. Improved Progressive
2.4. Tree comparison
Refinement is performed iteratively
MUSCLE. Stage 3. Refinement.
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 3. Refinement.
3.1. Delete edge from the Tree.
Choice of bipartition
An edge is removed from the tree, dividing the sequences
into two disjoint subsets
X5
X4
X2
X3
X1
X5
X1
X4
X2
X3
MUSCLE. Stage 3. Refinement.
3.1. Delete edge from the Tree.
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 20
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 3. Refinement.
3.2. Compute subtree profiles.
The multiple alignment of each subset is extracted from
current multiple alignment. Columns made up of indels
only are removed
TCC--AATCA--GATCA--AAG--ATACT--CTGC
X1
X3
X4
X5
X2
TCC--AATCA--AA
TCA--GA
T--CTGCG--ATAC
TCCAATCAAA
MUSCLE. Stage 3. Refinement.
3.2. Compute subtree profiles.
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 3. Refinement.
3.3. Re-align profiles.
The two profiles are then realigned with each other using
profile-profile alignment.
TCA--GA
T--CTGCG--ATAC
T--CCAAT--CAAATCA--GA
T--CTGCG--ATAC
TCCAATCAAA
MUSCLE. Stage 3. Refinement.
3.3. Re-align profiles.
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 21
Draft
Progressive
Improved
Progressive
Refinement
MUSCLE. Stage 3. Refinement.
3.4. Accept/Reject.
The score of the new alignment is computed, if the score is
higher than the old alignment, the new alignment is
retained, otherwise it is discarded.
T--CCAAT--CAAATCA--GA
T--CTGCG--ATAC
TCC--AATCA--GATCA--AAG--ATACT--CTGC
New Old
OR
MUSCLE. Stage 3. Refinement.
3.4. Accept/Reject.
1234
ACGT match=1
ACGA mismatch=0
AGGA
1: A-A + A-A + A-A = 1+1+1 = 3
2: C-C + C-G + C-G =1+0+0 = 1
3: G-G + G-G + G-G = 1+1+1 = 3
4: T-A + T-A + A-A = 0+0+1 =1
S(alignment) = S(1) + S(2) + S(3) + S(4) = 3+1+3+1 = 8
The higher the score, the better the alignment
MUSCLE. Stage 3. Refinement.
3.4. Accept/Reject.
Score of alignment
MUSCLEMultiple Sequence Alignment with reduced time and space complexity
MUSCLE
T-coffee
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 22
An incorrect conclusion may come from a sequence
alignment using incorrect assumptions
Supose you want to
align a set of MtrB
sequences retrived by
gene name
fromNCBI
Identification of TRAP orthologs as an example
of the risk of common mistakes in the analysis
Computatonal Genomic group
Angela Valbuzzi and Charles Yanofsky. SCIENCE VOL 293 14 SEPTEMBER 2001
Biología Computacional
TRAP is form of 11 identical subunits
Insted of Ribosome, attenuation is mediated by an RNA binding
protein called TRAP (trp RNA-Binding Attenuation Protein )
Biología Computacional
Secuencias TRAP(TRyptophan Attenuation Protein)
In Bacillus subtilis the trp operon is also regulated by
transcription attenuation
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 23
Supose you want to
align a set of MtrB
sequences retrived by
gene name
fromNCBI
An incorrect conclusion may come from a sequence
alignment using incorrect assumptions
MtrB [Desulfobacterium autotrophicum HRM2]
Signal transduction histidine kinase, nitrate/nitrite-specific
MtrB [Bacillus amyloliquefaciens FZB42]
Tryptophan RNA-binding attenuator protein
An incorrect conclusion may come from a sequence
alignment using incorrect assumptions
Never forget that MSA is just a model that
performs on a set of sequences given by the user
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 24
Multiple sequence alignment
Exercise:
Use multiple sequence alignment to analyze how
our model antiTRAp align with their
corresponding likely long distant homologs.
Sequence search
based on
antiTRAP
Protein sequence
Use multiple sequence alignment to analyze how antiTRAp
align with their corresponding likely long distant homologs
Use multiple sequence
alignment to analyze
how antiTRAp align
with their corresponding
likely long distant
homologs
Use multiple sequence alignment to analyze how antiTRAp
align with their corresponding likely long distant homologs
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 25
>gi|16077322|ref|NP_388135.1| hypothetical protein BSU02530 [Bacillus subtilis subsp.
subtilis str. 168]
MVIATDDLEVACPKCERAGEIEGTPCPACSGKGVILTAQGYTLLDFIQKHLNK
>gi|52078761|ref|YP_077552.1| inhibitor of TRAP, regulated by T-box (trp) sequence RtpA
[Bacillus licheniformis ATCC 14580]
MVIATDDLETTCPNCNGSGREEPEPCPKCSGKGVILTAQGSTLLHFIKKHLNE
>gi|154684753|ref|YP_001419914.1| RtpA [Bacillus amyloliquefaciens FZB42]
MTGDGQTIKKGGIFMVIATDDLELTCPHCEGTGEEKEGTPCPKCGAKGVILTAQGNTLLHFIRKHIDQ
>gi|116749904|ref|YP_846591.1| hypothetical protein Sfum_2476 [Syntrophobacter
fumaroxidans MPOB]
MVRMRLPELETKCWMCWGSGKIASEDHGGGMECPECGGVGWLPTADGRRLLDFVQRHLGIVEEGEDNETL
>gi|221194637|ref|ZP_03567694.1| chaperone protein DnaJ [Atopobium rimae ATCC 49626]
MASMNEKDYYVILEVSETATTEEIRKAFQVKARKLHPDVNKAPDAEARFKEVSEAYAVLSDEGKRRRYDA
MRSGNPFAGGYGPSGSPAGSNSYGQDPFGWGFPFGGVDFSSWRSQGSRRSRAYKPQTGADIEYDLTLTPM
QAQEGVRKGITYQRFSACEACHGSGSVHHSEASSTCPTCGGTGHIHVDLSGIFGFGTVEMECPECEGTGH
VVADPCEACGGSGRVLSASEAVVNVPPHAHDGDEIRMEGKGNAGTNGSKTGDFVVRVRVPEEQVTLRQSM
GARAIGIALPFFAVDLATGASLLGTIIVAMLVVFGVRNIVGDGIKRSQRWWRNLGYAVVNGALTGIAWAL
VAYMFFSCTAGLGRW
>gi|224372791|ref|YP_002607163.1| chaperone protein DnaJ [Nautilia profundicola AmH]
MDYYEILGVERTATKVEIKKAYRKLAMKYHPDKNPGDKEAEEMFKKINEAYQVLSDDEKRAIYDKYGKEG
LEGQGFKTDFDFGDIFDMFNDIFGGGFGGGRAEVQMPYDIDKAIEVTLEFEEAVYGVSKEIEINYFKLCP
KCKGSGAEEKETCPSCHGRGTIIMGNGFMRISQTCPQCSGRGFIAKKVCNECRGKGYIVESETVKVDIPA
GIDTGMRMRVKGRGNQDISGYRGDLYLIFNVKESKIFKRKGNNLIVEVPIFFTSAILGDTVKIPTLSGEK
EIEIKPHTKDNTKIVFRGEGIADPNTGYRGDLIAILKIVYPKKLTDEQRELLEKLHKSFGGEIKEHKSIL
EEAIDKVKSWFKGS
>gi|57867036|ref|YP_188723.1| dnaJ protein [Staphylococcus epidermidis RP62A]
MAKRDYYEVLGVNKSASKDEIKKAYRKLSKKYHPDINKEEGADEKFKEISEAYEVLSDENKRVNYDQFGH
DGPQGGFGSQGFGGSDFGGFEDIFSSFFGGGSRQRDPNAPRKGDDLQYTMTITFEEAVFGTKKEISIKKD
VTCHTCNGDGAKPGTSKKTCSYCNGAGRVSVEQNTILGRVRTEQVCPKCEGSGQEFEEPCPTCKGKGTEN
KTVKLEVTVPEGVDNEQQVRLAGEGSPGVNGGPHGDLYVVFRVKPSNTFERDGDDIYYNLDISFSQAALG
DEIKIPTLKSNVVLTIPAGTQTGKQFRLKDKGVKNVHGYGYGDLFVNIKVVTPTKLNDRQKELLKEFAEI
NGENINEQSSNFKDRAKRFFKGE
Use multiple sequence alignment to analyze how antiTRAp
align with their corresponding likely long distant homologs Use multiple sequence alignment to analyze how antiTRAp
align with their corresponding likely long distant homologs
UPGMA
Unweighted Pair Group Method with Arithmetic mean
One of the fastest and tree construction methods
Used in Pileup (GCG package)
Clustal uses neighbor joining, but calculating NJ tree is much more demanding; thus, UPGMA is demonstrated here
UPGMA tree
Alineamientos múltiples de secuencias Taller Latinoamericano de Evolución Molecular, CCG-
UNAM, México. 17-28 de Enero de 2011
Enrique Merino, IBT-UNAM 26
Constructing MSA
human ACGTACGTCC
chimp ACCTACGTCC
gorilla ACCACCGTCC
orangutan ACCCCCCTCC
maqaque CCCCCCCCCC
human ACGTACGTCC
chimp ACCTACGTCC
gorilla ACCACCGTCC
orangutan ACCCCCCTCC
human ACGTACGTCC
chimp ACCTACGTCC
gorilla ACCACCGTCC
orangutan ACCCCCCTCC
Similarity Measure
Similarity is calculated for each pair of sequences using
fractional identity computed from their mutual alignment
in the current multiple alignment
TCC--AATCA--GATCA--AAG--ATACT--CTGC
TCC--AATCA--AA
MUSCLE.
Stage 2: Improved Progressive
MUSCLEMultiple Sequence Alignment with reduced time and space complexity