multiple sequence alignment — the basics introduction …stevet/introbioinfo/msa.pdfmultiple...

8
Steve Thompson 2/11/09 1 Feb. 9, 2009 BSC4933(04)/ISC5224(01): Introduction to Bioinformatics Florida State University School of Computational Science and Department of Biological Science More data yields stronger analyses — if done carefully! The patterns of conservation become ever clearer by comparing the conserved portions of sequences amongst a larger and larger dataset. Mosaic ideas and evolutionary ‘importance.’ Multiple Sequence Alignment — the basics Steven M. Thompson Florida State University Department of Scientific Computing Molecular evolutionary analysis; plus Probe/primer, and motif/profile design; Graphical illustrations; and Comparative ‘homology’ inference. OK — here’s some examples. Now then, why even bother — Applicability? Molecular evolution and phylogenetics We all know multiple sequence alignments are necessary for phylogenetic inference, but does everybody here truly realize that the absolute positional homology of every column in a data matrix passed on to these programs is the most critical assumption that all the algorithms make (but see Bayesian coestimation)!

Upload: hoangkiet

Post on 17-May-2018

218 views

Category:

Documents


4 download

TRANSCRIPT

Steve Thompson 2/11/09

1

Feb. 9, 2009

BSC4933(04)/ISC5224(01):

Introduction to BioinformaticsFlorida State University

School of Computational Scienceand Department of Biological Science

More data yields stronger analyses — if done carefully! Thepatterns of conservation become ever clearer by comparing theconserved portions of sequences amongst a larger and larger

dataset. Mosaic ideas and evolutionary ‘importance.’

Multiple SequenceAlignment — the basics

Steven M. ThompsonFlorida State University Department

of Scientific Computing

Molecular evolutionary analysis; plus

Probe/primer, and motif/profile design;

Graphical illustrations; and

Comparative ‘homology’ inference.

OK — here’s some examples.

Now then, why even bother— Applicability?

Molecular evolution andphylogeneticsWe all know multiple sequencealignments are necessary forphylogenetic inference, but doeseverybody here truly realize that theabsolute positional homology of everycolumn in a data matrix passed on tothese programs is the most criticalassumption that all the algorithmsmake (but see Bayesian coestimation)!

Steve Thompson 2/11/09

2

And what about this other stuff?

Multiple sequence alignments can beindispensable for primer design whenyou don’t have data on a particulartaxa, yet data is available in relatedtaxa. The conservation andvariability within an alignment canhelp guide the design of universal orspecies specific primers.

Here’s an HPV L1 example

The ellipses show areas where PCR primers could differentiate the Type 16 cladefrom it’s closest relatives — areas of high L1 conservation in the Type 16 clade (redline) that correspond to areas of much weaker conservation in the others (blue line).

Motif and profile definitionAn alignment of humanSRY/SOX proteinsillustrates theconservation of theHMG box. Conservedregions can bevisualized with a slidingwindow approach andappear as peaks.Motifs and (better yet)HMM profiles can becreated of the region tobe used as a searchtool to find other HMGbox proteins.

HMGbox

One picture’s worth . . .

The HMG-box domain is strikingly conserved amongst the otherwisenearly unalignable human DNA regulatory paralogous protein family.

Steve Thompson 2/11/09

3

Structure/function homology inferenceA Swiss-Modelhomology based modelof Giardia EF1αsuperimposed over itseight most similarsequences with solvedstructure. Amazinglyaccurate inferences ofboth function andstructure are possibleusing comparativemethods.

On to aligning multiple sequences —dynamic programming’s complexityincreases exponentially with the number ofsequences being compared:

N-dimensional matrix . . . .complexity O ( [sequence length]number of sequences )

See —

MSA (‘global’ within ‘bounding box’) and

PIMA (‘local’ portions only) on themultiple alignment page at the

Both available at the Baylor College ofMedicine’s Search Launcher —

http://searchlauncher.bcm.tmc.edu/ —but, severely limiting restrictions!

A couple ‘global’ solutions usingheuristic tricks

. . . restricts thesolution to the neighbor-hood of only twosequences at a time.

All sequences arecompared, pairwise,and then each isaligned to its mostsimilar partner or groupof partners representedas a consensus. Eachgroup of partners isthen aligned to finishthe complete multiplesequence alignment.

Therefore — pairwise, progressivedynamic programming . . .

Steve Thompson 2/11/09

4

Enhancements on the themeFirst enhancements came from ClustalW —

variable sequence weighting, dynamicallyvarying gap penalties and substitutionmatrices, and a neighbor-joining guide-tree.

Since the year 2000 a slew of new programshave tried other heuristic variations, all inattempts to build faster, more accuratemultiple sequence alignments. The devil’s inthe details: Muscle, ProbCons, T-Coffee,MAFFT and many, many more.

This was pretty much the original ClustalVand GCG’s PileUp program . . . then . . .

MuscleAn iterative method that uses weighted log-expectationprofile scoring along with a slew of optimizations. Itproceeds in three stages — draft progressive using k-mer counting, improved progressive using a revisedtree from the previous iteration, and refinement bysequential deletion of each tree edge with subsequentprofile realignment.

ProbConUses Hidden Markov Model (HMM) techniques andposterior probability matrices that compare randompairwise alignments to expected pairwise alignments.Probability consistency transformation is used toreestimate the scores, and a guide-tree is thenconstructed, which is used to compute the alignment,which is then iteratively refined. Incredibly accurate.

T-CoffeeUses a preprocessed, weighted library of all the pairwiseglobal alignments between your sequences, plus the tenbest local alignments associated with each pair. Thishelps build the NJ guide-tree and the progressivealignment. The library is used to assure consistency andhelp prevent errors, by allowing ‘forward-thinking’ to seewhether the overall alignment will be better one way oranother after particular segments are aligned one way oranother. The institutional schedule analogy . . . .T-Coffee can even tie together multiple methods asexternal modules, making consistency libraries from theresults of each, as long as all the specified methods areinstalled on your system. T-Coffee is one of the mostaccurate multiple sequence alignment methods availablebecause of this consistency based rationale, but it is notthe fastest. Regardless, I encourage you to check it out!

MAFFT— has many modes, among them: a couple ofprogressive, approximate modes, using a fast Fouriertransformation (FFT); a couple of iteratively refinedmethods that add in weighted-sum-of-pairs (WSP)scoring; and several iterative methods that use WSPscoring combined with a T-Coffee-like consistencybased scoring scheme. Speed and accuracy areinversely proportional for these from fast and rough,to slow and accurate, respectively.

MAFFT provides command aliases for all of these,from fast to slow — FFTNS with or without retree,FFTNSI with or without maxiterate, and the threecombined approaches EINSI, LINSI, and GINSI.

Steve Thompson 2/11/09

5

MAFFT’s basic algorithmMAFFT’s fast Fourier transform provide a huge speedup overprevious methods. Homologous regions are quickly identified byconverting amino acid residues to vectors of volume and polarity,thus changing a twenty-character alphabet to six, rather than byusing an amino acid similarity matrix. Similarly, nucleotide basesare converted to vectors of imaginary and complex numbers. TheFFT trick then reduces the complexity of the subsequentcomparison to O ( N logN ). FFT identifies potential similaritiesthough, without localizing them; a sliding window step using theBLOSUM62 matrix is used for this.

Then MAFFT constructs a distance matrix, and hence aprogressive guide tree, on the number of shared six-tuples fromthis Fourier transform, rather than on a ranking based on full-length, pairwise sequence similarity. The user can specify howmany times a new guide tree is subsequently recalculated from aprevious alignment as many times as desired; the alignment isreconstructed using the Needlman Wunsch algorithm each time.

Some of MAFFT’s many modesAnd each mode has a bunch of additional options!1) Most basic, fastest modes — just progressive.

a) FFTNS1 (fftns --retree 1)b) FFTNS2 (fftns) (same as mafft --retree 2)

Suitable for 1,000’s of easily aligned sequences.A rough distance matrix is built from the sequences

using FFT and the shared number of six-mers.A modified UPGMA guide tree is built from this matrix.The sequences are aligned according to the rough,

initial guide tree (as in ‘traditional’ methods).FFTNS2 adds a recomputation of the guide tree

(retree 2) from the original alignment, fromwhich a new progressive alignment is built.

MAFFT’s interative refinements2) Intermediate modes — progressive + iterations

to maximize the WSP objective function.a) FFTNSI (fftnsi) default two cycles, or e.g.

fftnsi --maxiterate 1000

b) NWNSI (nwnsi) same as FFTNSI, but noFFT, Needleman Wunsch only.

Progressive alignment and retree as before, with orwithout FFT, and then . . . .

Iterative refinement is cycled twice (default), orrepeatedly until there is no further improvement,or until you reach your specified limit number.

Suitable for 100’s through 1000’s of sequences.

MAFFT’s most accurate modes3) Advanced modes — progressive + iterations to

maximize the objective WSP and T-Coffee-likeconsistency functions. Options differ according tothe way the pairwise alignments are calculated.a) EINSI (einsi) most general of these.

Uses a Smith Waterman style local algorithmwith generalized affine gap costs for the pairwisestep. Most appropriate for sequences with multi-shared, similarly ordered domains, in an otherwisenearly unalignable ‘mess,’ .e.g:

ooooooXXX------XXXX-----------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo------XXXXXXXXXXXXXooo--------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------ooooXXXXXX---XXXXooooooooooo------------XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo------XXXXX----XXXXoooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX----------------XXXXX----XXXX-----------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----

Steve Thompson 2/11/09

6

MAFFT’s most accurate modes, cont.3) Advanced modes — progressive + iterations to

maximize the objective WSP and T-Coffee-likeconsistency functions. Options differ according tothe way the pairwise alignments are calculated.b) LINSI (linsi) strictly local.

Uses a Smith Waterman style local algorithmwith affine gap costs for the pairwise step. Mostappropriate for sequences with only one single,shared domain, in an otherwise nearly unalignable‘mess,’ .e.g:

--------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo--------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----

MAFFT’s most accurate modes, cont.3) Advanced modes — progressive + iterations to

maximize the objective WSP and T-Coffee-likeconsistency functions. Options differ according tothe way the pairwise alignments are calculated.c) GINSI (ginsi) strictly global.

Uses a Needleman Wunsch style globalalgorithm with affine gap costs for the pairwise step.Most appropriate for sequences where only onesingle, shared domain extends the full length of allof the sequences, .e.g:

XXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXooooXXXooXXX-XXXXXXXXXXXXXXXXXX-XXXXXXXX--XXXXXXX---XXXXX--XXXXX---XXXXXXXXXXXXXXXXXXXoooooXXoooXXooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXXX-XXXXX---XXXXXXXXXX--XXXXXXXoooooXXXXXXXXX--

How to know when to use what

for all of them — Take home message:

For simple cases it doesn’t really matter whatprogram to use. For complicated situations it may,and what you use will depend on the size of yourdataset, personal preferences, time allotted, andhow much hand editing you want to do.

Really nice, recent review: Edgar, R.C. andBatzoglou, S. (2006) Multiple sequence alignment.Current Opinion in Structural Biology 16, 368–373.

The rest of my references can be found in mytutorial manuscript.

for MAFFT — see “tips,” 2,3, and 4 pages,

You can do a lot of this stuff on theWeb, if you need to — some resourcesfor multiple sequence alignment:http://www.techfak.uni-

bielefeld.de/bcd/Curric/MulAli/welcome.html.http://pbil.univ-lyon1.fr/alignment.htmlhttp://www.ebi.ac.uk/clustalw/http://searchlauncher.bcm.tmc.edu/

However, problems with very large datasetsand huge multiple alignments make doingmultiple sequence alignment on the Webimpractical after your dataset has reached acertain size. You’ll know it when you’re there!

Steve Thompson 2/11/09

7

Reliability and theComparative Approach —explicit homologous correspondence;manual adjustments should be

encouraged — based on knowledge,especially structural, regulatory, and

functional sites.Therefore, editors like SeaView anddatabases like the Ribosomal Database

Project:http://rdp.cme.msu.edu/index.jsp

Coding DNA issuesWork with proteins! If at all possible.Twenty match symbols versus four, plus

similarity versus identity!Way better signal to noise.

Also guarantees no indels are placed withincodons. So translate, then align.

Nucleotide sequences will only reliably align ifthey are very similar to each other. Andthey will likely require extensive andcarefully considered hand editing with aneditor like SeaView.

Beware of aligning apples andoranges [and grapefruit]!

receptors and/oractivators with theirnamesake proteins;parologous versusorthologous;genomic versuscDNA;mature versusprecursor.

Mask out uncertain areas —

Steve Thompson 2/11/09

8

Complications —Order dependence.

Not that big of a deal.Substitution matrices and gap penalties.

Can be a very big deal!

Regional ‘realignment’ becomesincredibly important, especially withsequences that have areas of high andlow similarity. SeaView let’s you do this!

Complications cont. —

Format hassles!Specialized format conversion tools

such as GCG’s SeqConv+ programand PAUPSearch, and

Don Gilbert’s public domain ReadSeqprogram.

Plus, some programs like SeaViewcan read and write several formats.

Still more complications —

Indels and missingdata symbols (i.e.gaps) designationdiscrepancyheadaches —

., -, ~, ?, N, or X. . . . . Help!

FOR MORE INFO...

Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.Contact me ([email protected]) for specific long-distance

bioinformatics assistance and collaboration.

Gunnar von Heijne in his old but quite readable treatise, SequenceAnalysis in Molecular Biology; Treasure Trove or Trivial Pursuit(1987), provides a very appropriate conclusion:

“Think about what you’re doing; use your knowledge of the molecularsystem involved to guide both your interpretation of results and yourdirection of inquiry; use as much information as possible; and do notblindly accept everything the computer offers you.”

He continues:

“. . . if any lesson is to be drawn . . . it surely is that to be able to make auseful contribution one must first and foremost be a biologist, and onlysecond a theoretician . . . . We have to develop better algorithms, wehave to find ways to cope with the massive amounts of data, and aboveall we have to become better biologists. But that’s all it takes.”

Conclusions —