sequence alignments and sequence database searching igor kuznetsov bioinformatics workshop part i
DESCRIPTION
Sequence alignments and sequence database searching Igor Kuznetsov Bioinformatics workshop Part I Sponsored by Kansas NSF EPSCoR and K-BRIN. E-mail: [email protected] Office: 1002 Haworth. Pair-wise sequence alignment. What is pair-wise sequence alignment?. - PowerPoint PPT PresentationTRANSCRIPT
Sequence alignments and sequence Sequence alignments and sequence database searchingdatabase searching
Igor Kuznetsov
Bioinformatics workshop Part I
Sponsored by Kansas NSF EPSCoR and K-BRIN
E-mail: [email protected]: 1002 Haworth
Pair-wise sequence alignment
What is pair-wise sequence What is pair-wise sequence alignment?alignment?
• Given two sequences of characters and a scoring scheme for comparing individual characters, it is a one-to-one mapping of equivalent characters between the two sequences that preserves the order of characters in both sequences:
Sequence 1: THISISASTRINGSequence 2: THISISANOTHERLONGERSTRING
THISISA----------------------STRING | | | | | | | | | | | | |
THISISANOTHERLONGERSTRING
Biological sequences consist of Biological sequences consist of characters that encode biological characters that encode biological
function.function.Proteins: 20 characters
(A,G,E,C,W,S,D,F,H,I,K,L,P,Q,M,N,V,Y,R,T)
CAP protein:
Unknown: NLRHRETSTLSGVTRETVTRTLGKLEKKGL
Known: GTPHRELSSI SGLARETVTRCLTRHFALCFhelix 1 helix 2turn
Here similar amino acid sequences perform the same function and have the same structure: helix-turn-helix motif.
Alignment is THE tool of Alignment is THE tool of bioinformatics. It is used to quantify bioinformatics. It is used to quantify and to visualize sequence similarity.and to visualize sequence similarity.
It is assumed that:
• “Good” alignment = Similar sequences • Similar sequences are likely to have similar function
and/or structure.• Can use knowledge-based approach –
if one sequence has known structure/function, sequence alignment can be used to “map” this knowledge onto other, similar sequences.
How to tell that alignment is “good”?How to tell that alignment is “good”?
• The total number of possible alignments is huge.• We need an objective function (scoring system) to identify the best alignment (called an optimal alignment).
Three possible variants for each Three possible variants for each position in a sequence position in a sequence
alignment :alignment :1. Exact match – a pair of identical characters are aligned.2. Inexact match (substitution) – two different characters are
aligned.3. Gap (insertion/deletion) – a character is missing in one
sequence.
ACT -ACGGATAGATTTAGGG - -AG
Total: 2 substitutions, 7 exact matches, 2 gaps
gap of length 2
gap of length 1
substitution
exact match
A general scoring system for A general scoring system for sequence alignment consists of sequence alignment consists of
two parts:two parts:1. A score for aligning a pair of characters (a,b):
S(a,b)=S(b,a)2. A penalty for each gap as a function of its length, k:
g=f(k) Then the total pair-wise alignment score between two sequences X and Y is:
gapsall
bapairsalignedall
gbasYXScore
),(
),(),(
X: ATGCTGGGA
Y: ATCCT - - GA
S(a,b)
g
For the highlighted position: a=G, b=C and S(G,C)=S(C,G)
Amino acid scoring matrix, Amino acid scoring matrix, s(a,b)s(a,b)
• A symmetric 20x20 matrix (we have 210 possible residue pairs)• If residues a and b are similar, s(a,b) > 0• If residues a and b are dissimilar, s(a,b) < 0
Gap penalty functionGap penalty function• It is gaps that make alignment computationally complex.• Structural meaning of gaps in proteins – a non-essential part of
the sequence is lost or a new part is added. If alignment is wrong, incorrect part will be assumed as inserted/deleted:
Seq 1: correctSeq 2:
Seq 1:Seq 2: incorrect
protein 1 protein 2
An example: scoring an alignment of An example: scoring an alignment of two DNA sequences using a similarity two DNA sequences using a similarity
matrixmatrix
sequence 1
sequence 2
Matrix for s(a,b)
We will use a linear affine gap penalty: g(k)=+*(k-1)
The meaning of the alignment score
When we perform a summation over all positions of the alignment of sequences X and Y, the total score S(X,Y) gives us an estimate of the likelihood that the alignment represents similarity compared to aligning X and Y at random:
S(X,Y) > 0 – more likely than random (many similar amino acids are aligned)
S(X,Y) < 0 – less likely than random (many dissimilar amino acids are aligned)
S(X,Y) = 0 – random alignment
Dot Plot – the simplest way to visualize pair-wise alignmentDot Plot – the simplest way to visualize pair-wise alignment
Put a dot in the cell (i,j) if character in row i is the same as in column j, A(i) = A(j).
palindromeATTA
invertedrepeat
AT - TA
On-line Java-based tool for computing Dot PlotsOn-line Java-based tool for computing Dot Plots
• http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
Optimal sequence alignmentsOptimal sequence alignments
We can compute two types of optimal alignments:• GLOBAL – aligns entire sequence 1 against entire sequence 2• LOCAL – aligns region of sequence 1 against region of sequence 2
Very often local alignment is a better choice, since it finds only similar subsequences and does not attempt to align totally unrelated segments.
Local alignment
Global alignment
Meaningless alignment
Optimal local alignmentOptimal local alignment
CCCCCCCCCCCCATATATATACCCCCCCCCCCC
vs.
GGGGATATTTATAGGGGGGGGGGG
ATAT ATATA
ATATTTATA
We want only this part
Sub-optimal local alignmentsSub-optimal local alignments
• In order to find all biologically meaningful similarities between two sequences we need to find all high scoring local alignments, not just the optimal one.
• Using a technique called de-clumping local alignment programs can compute any given number of sub-optimal alignments (2nd best, 3rd best, etc):
Optimal (best)3rd best2nd best
Sequence 1Sequence 2
Global alignment with zero end gap Global alignment with zero end gap penaltiespenalties
• Used to align short sequence with long sequence:
GCGCGCCGCGCGATATATATAGCGCGCCCGCGC
vs.
ATATTTTATATA
GCGCGCCGCGCGATAT ATATAGCGCGCCCGCGC ATATTTTATATA
These gaps are not penalized
This gap is penalized
• Percent identity = #identical pairs/alignment length• Percent similarity = #similar pairs/alignment length
(similar pairs are those that receive a positive score)• Alignment score = positions + gaps
Alignment length = 36Alignment score = 19Percent identity = 16.7%Percent similarity = 52.8%
Basic notation
Example: global vs. localExample: global vs. local
Optimal global alignment obtained using BLOSUM50, A=-12, B=-2
Optimal local alignmentobtained usingBLOSUM50, A=-12, B=-2
Global vs. Local:Global vs. Local:
• Use global alignment if – You expect, based on some biological information, that your
sequences will match over the entire length. – Your sequences are of similar length.
• Use local alignment if – You expect that only certain parts of two sequences will match (as
in the case of conserved segment that can be found in many different proteins).
– Your sequences are very different in length.– You want to search a sequence database (we will talk about it in
details later). – NOTE: local alignment works only with similarity matrices because
the total score must be able to take +, - and 0 values.
Selecting the right protein Selecting the right protein scoring matrixscoring matrix
• PAM matrices:– PAM250 is suitable for highly diverged sequences (>30% identity)
and is the best choice for aligning unknown proteins.– PAM160 is suitable for sequences which are 50-60% identical.– PAM40 is suitable for very similar proteins (70-90% identity)
• BLOSUM matrices:– BLOSUM50-62 is suitable for highly diverged sequences (>30%
identity) and is the best choice for aligning unknown proteins.– BLOSUM80 is suitable for 50% identity level.– BLOSUM90 is suitable for very similar proteins.
• PAM250 BLOSUM45, PAM160 BLOSUM62PAM120 BLOSUM80– Note that an increase in numbering in PAM corresponds to a
decrease in BLOSUM numbering and vice versa.
Lowering the gap penalties results in a Lowering the gap penalties results in a different optimal alignment with higher % different optimal alignment with higher % identity.identity.
Low gap penalties -> artificially high percent of identity betweensequences
Low complexity regionsLow complexity regions
• Real biological sequences have many regions where one or a few characters are over-represented (so-called low complexity regions):
ATGGPTIVLLVAAAAAAAAAAGPTPGLILW | | | | | | | | | EVVIKPSMCDHAAAATAAAAALCMKFC
• Such regions will bias the alignment because they tend to align with each other. Even absolutely unrelated sequences will have regions of false “similarity”.
• Mask these regions using PSEG for proteins, NSEG or DUST for DNA.
Masking low complexity regions in Masking low complexity regions in proteins (PSEG program)proteins (PSEG program)
General suggestions for doing pairwise General suggestions for doing pairwise alignmentalignment
1. Always translate protein-coding DNA sequences – protein sequences contain more information.
General suggestions for doing General suggestions for doing pairwise alignmentpairwise alignment
2. Make a dotplot of each sequence with itself• Look for low complexity regions:
General suggestions for doing General suggestions for doing pairwise alignmentpairwise alignment
2. Make a dotplot of each sequence with itself• Look for repeats:
General suggestions for doing General suggestions for doing pairwise alignmentpairwise alignment
1. Translate protein-coding DNA sequences – protein sequences contain much more information.
2. Make a pairwise dotplot and a dotplot of each sequence with itself.
• Watch out for low complexity regions and repeats. • Mask low complexity regions, remove repeats.
3. Do a local alignment first to see whether your sequences are similar over their entire lengths.
• If they are or if you are absolutely positive that despite the lack of apparent sequence similarity your sequences can be aligned globally, use global alignment.
• ALWAYS perform a visual inspection.
General suggestions for doing General suggestions for doing pairwise alignmentpairwise alignment
4. Use the default gap penalties supplied by the program first. The choice of gap penalties depends on the matrix.
5. Some matrices are good for distantly related sequences, some are good for closely related sequences.
6. Do alignments with different matrices and gap penalties. Regions that produce consistent alignments usually can be trusted.
7. Usually, the number of optimal alignments > 1.
8. The optimal alignment found by alignment program is not necessarily the correct one from the biological point of view.
• One of the nearly-optimal alignments may be what you need. You may need to adjust the optimal alignment manually according to your biological “hunch”.
Pairwise alignment programsPairwise alignment programs
1. Stand-alone programs from FASTA package http://fasta.bioch.virginia.edu/
• ALIGN – global alignment (affine gap penalties).
• ALIGN0 – global alignment with 0 end gap penalties (affine gap penalties).
• LALIGN – n best scoring local alignments (affine gap penalties).
2. On-line alignment programs available from KU Bioinformatics: http://jay.bioinformatics.ku.edu/EMBOSS/index.html
• NEEDLE (global alignment with affine gap penalties)
• WATER (local alignment with affine gap penalties)
• DotPlot
3. All possible types of alignments on one web-server at USC:• http://www-hto.usc.edu/software/seqaln/seqaln-query.html
Application of DP to search Application of DP to search sequence databasessequence databases..
Objective of sequence database Objective of sequence database searches:searches:
• Given a DNA or protein sequence, find all sequences in the database that are similar to this sequence.
• Main problem is how to pick all similar sequences and filter out all dissimilar sequences. Final result always depends on the definition of similarity.
Some terminology used in Some terminology used in sequence searchingsequence searching
• Query sequence – the sequence of interest which is compared to the database sequences.
• Significance threshold – the critical value of the variable (alignment score, probability, etc) used to assess the alignment between query sequence and a database sequence and to draw conclusions about similarity between them.
• Hit – an alignment between query sequence and database sequence that scores above the significance threshold.
SEARCHING SEQUENCE SEARCHING SEQUENCE DATABASESDATABASES
• OBJECTIVE: given a query sequence, find all database sequences that are significantly similar to the query sequence.
• IMPLEMENTATION:– Align query sequence to each database sequence using local
alignment (N best alignments in total). – Keep only those alignments that score higher than certain
significance threshold.
X X X
XX
X
Xhit
hit
hit
E-valueE-value
• The significance of observed similarity between query and database sequence is assessed using the E-value.
• E-value is a normalized statistic that gives the number of matches with a score equal to or greater than the observed score, x, that are expected to occur by chance when searching a database of given size N.
• The lower the E-value, the higher the chance that similarity between query and database sequence is significant.
• However, similarities we deem “significant” depend on an arbitrary choice of E-value.
The twilight zoneThe twilight zone
• When percent identity between two aligned sequences drops below 25-30%, the estimates of statistical significance fail to distinguish between related and unrelated sequences. In other words, true similarity cannot be distinguished from random match.
• This sequence identity threshold is usually referred to as the the twilight zone twilight zone of pairwise sequence alignment. This is the green area of the overlap we saw previously.
• Proteins with similar structure/function that have pairwise sequence identity below 25-30% can score lower than structurally and functionally dissimilar proteins.
Mathematically Rigorous Search: Mathematically Rigorous Search:
Smith-Waterman MethodSmith-Waterman Method1. SSEARCH – a stand-alone program from FASTA
package http://fasta.bioch.virginia.edu/
2. On-line alignment available from EMBL-EBI: http://www.ebi.ac.uk/MPsrch/
• SW search with column-cost and affine gap penalties
• SW is the most rigorous method to do a database search using a single query sequence. It is also the slowest one.• Heuristic search methods are much faster: BLAST – Basic Local Alignment Search Tool.
The main idea of BLAST
Protein BLAST requires two neighboring hits of length W
(W=3 by default)
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT exact match
Nucleotide BLAST looks for one exact match of length W
(W=11 by default)
hit 1 hit 2
A
W=3 W=3Query sequence
Database sequence
An alignment that BLAST can’t find using default parameters
because there is no hit of length 11
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || || || ||| || | |||||| || | |||||| ||||| | |
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || || | | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Significance of BLAST hits
• Statistically significant does not always imply “biologically meaningful”.
• No strict rule how to choose E-value. Always a trade-off between sensitivity and specificity.
• E-value < 10-6 – almost surely homologous. Will miss remote homologues.
• Between 10-2 and 10-6 – probably homologous.• Between 10-2 and 10 – might or might not be
interesting… • In any case, use your own judgments.• Pairwise sequence comparison cannot detect remote
homologues reliably when sequence identity drops below 25-30%. Need more sophisticated approaches.
Protein databases for BLAST
DNA databases for BLAST
Important points to keep in mind when doing a sequence search
• Always translate cDNA sequences – protein sequences are much more informative (4 characters in DNA vs. 20 characters in proteins). Protein scoring matrices are also more informative.
• Filter low complexity regions (both DNA and proteins)• Mask interspersed repeats in DNA using
RepeatMasker.• Proteins: start with BLOSUM62, repeat search with
BLOSUM45• DO NOT change default gap penalties !!!
BLAST search
with BLOSUM62, BLOSUM45
query sequence
Smith-Waterman search
with BLOSUM62, BLOSUM45
Search against profiles and HMMs
No similarity
No similarity
No similarity = trouble
Multiple Sequence Alignment
What is Multiple Sequence What is Multiple Sequence Alignment (MSA)?Alignment (MSA)?
• Alignment of three or more sequences. • Each column contains characters that are equivalent
across all sequences.• Put a gap if sequence has no equivalent character in
given column:
Two problems with MSATwo problems with MSA
1. There is no good way to score multiple sequence alignments.
2. Optimal answer cannot be found for more than 6-7 sequences because of extremely high computational complexity.
– MSA is performed using heuristic methods that DO NOT guarantee an optimal solution.
– Progressive MSA is the most widely used method.
Alignment using the guide tree
Alignment of each sequence pair (i,j)
Distance for each sequence pair, d(i,j)
Guide tree
Once a gap, always a gapOnce a gap, always a gap
• A tricky part of progressive MSA is that we have to align two alignments. Scoring gaps is the biggest problem.
• To simplify the situation, gaps introduced at earlier stages of MSA cannot be changed as new sequences are added to the alignment.
Problems with progressive MSAProblems with progressive MSA
• A heuristic solution, no guarantee of optimal alignment.• The initial guide tree is not reliable and does not always give
correct relationship between aligned sequences.• Alignment errors made at initial stages of MSA cannot be
corrected and accumulate as more sequences are added to the alignment.
• Progressive alignment becomes very unreliable for very diverged sequences (below 30% identity).
• Only global alignment is used (attempts to align all parts of the sequences). Does not work well if sequences have very different lengths.
Programs for progressive MSAPrograms for progressive MSA
• PILEUP – commercial, from GCG package.• CLUSTAL – freeware, can be used both on-line
(http://www.ebi.ac.uk/clustalw/) and installed locally on Windows OS.
• CLUSTAL has certain advantages over PILEUP:– Weighting of sequences (removes effect of over-represented
similar sequences)– Empirical position-specific gap penalties (lower weights in
stretches of amino acids that may be variable loop regions)– Can align two user-supplied multiple alignments or align one
sequence with an existing MSA.
MSA should MSA should ALWAYSALWAYS be visually be visually inspected and, in most cases, edited inspected and, in most cases, edited
manuallymanuallyMSA editors:
• GeneDoc (WINDOWS, http://www.psc.edu/biomed/genedoc/)– Can be used to color MSA according to the secondary structure of
one sequence (if structure is known) or various a.a. properties– Provides reports on a.a. composition, pairwise scores
• BioEdit (WINDOWS, http://www.mbio.ncsu.edu/BioEdit/bioedit.html)– Has a lot of features. Can be integrated with accessory applications
and run CLUSTAL, BLAST, tree construction programs
• CINEMA (Java applet, run from web-browser, no installation http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/)
• MACAW (WINDOWS, Mac, current NCBI link is broken)
BioEdit
• Has a variety of options for both DNA and protein sequences:Search, Translate, Reverse, DotPlot, Edit and Shade MSA, etc.
• Works with a variety of formats including FASTA and Clustal
• BEST feature: capable of handling huge datasets
GenDoc
• Edit and Shade MSA, etc.
• Works with a variety of formats including FASTA and Clustal.
• BEST feature: can shade protein sequences using a secondary structure mask from DSSP.
MACAW• Old program. BEST feature: can construct MSA using
conserved blocks found with LOCAL pair-wise sequence alignment and Gibbs sampler.
Incorporation of local alignment features Incorporation of local alignment features into MSAinto MSA
• T-Coffee is an “add-on” program that interacts with CLUSTAL and uses both local and global alignment information to improve the results of MSA.
• Has been shown to improve the accuracy of CLUSTAL in some cases when sequences with low identity are being aligned.
Available on-line at: http://www.ch.embnet.org/software/TCoffee.html
Very time-consuming. Can be installed locally (requires CLUSTAL). Local version is available at KU Bioinformatics server.
T-Coffee
Main features:
• Combines alignment information from heterogeneous sources, including pair-wise local alignments, into a library.
• Uses the library to perform a progressive alignment.
Alignment of cDNA in codon-to-codon mode
Synonymous substitution
http://jay.bioinformatics.ku.edu/~codon/
The easiest way to use multiple The easiest way to use multiple sequences to search for remote sequences to search for remote
homologues: homologues: Position Specific Iterative BLAST Position Specific Iterative BLAST
(PSI-BLAST)(PSI-BLAST)
• Uses MSA in the form of Position Specific Scoring Matrix (PSSM) to detect remote homologues. Works for proteins only.
• Unlike standard BLAST, database sequences are aligned with PSSM not initial query sequence.
• PSI-BLAST is CONSIDERABLY more sensitive in finding divergent members of the same protein family than regular BLAST.
• The easiest way to use MSA for sequence search in completely automated mode.
Position Specific Scoring Matrix (PSSM)
Multiple SequenceAlignment
PSI-BLAST stepsPSI-BLAST steps
Protein database
BLAST search using PSSM
Standard BLASTP search
PSSM
Multiple sequence alignment
Query sequence
Output (sequence hits to PSSM)
Filter sequencesmanually
Filter sequences by inclusion E-value
Standard amino acid scoring matrix (e.g., BLOSUM62)
Example of how PSI-BLAST can be used to detect Example of how PSI-BLAST can be used to detect remote sequence similarities: 1AUX (remote sequence similarities: 1AUX (synapsinsynapsin) ) and 1GLV have only 17.5% sequence identity. and 1GLV have only 17.5% sequence identity.
Pairwise search methods fail.Pairwise search methods fail.
Use sequence of 1AUX to run PSI-
BLAST search
Sequences with E-value below thisthreshold will be used to construct PSSM
1st iteration is a regular BLAST search
2.5 times 2.5 times more hitsmore hits
PSI-BLAST iteration 2PSI-BLAST iteration 2
PSI-BLAST iteration 3PSI-BLAST iteration 3
4 times 4 times more hitsmore hits
PSI-BLAST iteration 4PSI-BLAST iteration 4
15 times 15 times more hitsmore hits
New hits with knownNew hits with knownstructure. Appearedstructure. Appearedonly on 4only on 4thth iteration. iteration.One of them is 1GLV
1GLV1AUX
Synapsin 1AUX and Glutathione synthetase 1GLV have very Synapsin 1AUX and Glutathione synthetase 1GLV have very similar structuressimilar structures
A word of caution on using PSI-BLASTA word of caution on using PSI-BLAST
• Avoid including very close sequences in order not to over-fit.
• Unrelated sequences may be included. Always check sequences on each iteration using biological knowledge.
• Results depend on the choice of initial query sequence. Some query sequences will produce considerably better results.
• Powerful, but results may be misleading if too many irrelevant sequences were included.
• E-value should be treated very carefully.
De-novoDe-novo motif detection motif detection
• MOTIF – a conserved region shared by many related sequences. These sequences can be functionally related, structurally related, or both.– Examples: TF-binding sites in upstream regions of co-regulated
genes.
Procedure:• Take a set of sequences that are presumed to contain
at least one unknown common motif.• Try to find this motif by comparing all similar segments
found in these sequences and using statistics to filter out random similarities.
Searching for ungappedSearching for ungapped LOCAL LOCAL MSAsMSAs
AAAAAAAAAAAGGGCGGAAAAA
TTTTTTTTTTTTTTTTTTTTTTGGGGGGTTTT
CCGTGGGGCCCAAAAA
AAAAACTGGGGGGCTCTCTCTCTCTCTCTCTC
AAAAAAAAAAAGGGCGGAAAAA
TTTTTTTTTTTTTTTTTTTTTTGTGGGGTTTT
CCGGGGGGCCCAAAAA
AAAAACTGGGGGGCTCTCTCTCTCTCTCTCTC
AAAAAAAAAAAGGGCGGAAAAA
CCGTGGGGCCCAAAAA
AAAAACTGGGGGGCTCTCTCTCTCTCTCTCTC
+
1ST motif
2nd motif
MEME and MAST
• MEME (Multiple Em for Motif Elicitation) – finds one or more ungapped motifs in a set of DNA or protein sequences. Analyses all possible motif sizes and locations. http://meme.sdsc.edu/meme/website/intro.htmlAlso available at KU Bioinformatics server.
• MAST (Motif Alignment and Search Tool) – uses motifs detected by MEME to search sequence databases (the same web site).
• Advantages: the most user-friendly program, supported, runs on multiple processors.
• Disadvantages: not the most sensitive program, misses weak motifs. Time-consuming.
MEME
• On-line version does not allow to change all arguments.• Main program arguments:
– OOPS (One Occurrence Per Sequence) – each sequence contains exactly one occurrence of a particular motif. Most sensitive to weak motifs. However, if some sequences do not contain the motif, it will be “blurry”.
– ZOOP (Zero or One Occurrence Per Sequence) – each sequence contains one or zero occurrences of a particular motif. May miss weak motifs, but will not include irrelevant sites.
– TCM – any number of motifs per sequence. Will find repeated motifs within each sequence.
– Number of motifs to find– E-value threshold – the program find motifs with the lowest E-
value first, stops when E-value goes above this threshold.