sequence alignments and sequence database searching igor kuznetsov bioinformatics workshop part i

Sequence alignments and sequence Sequence alignments and sequence database searchingdatabase searching

Igor Kuznetsov

Bioinformatics workshop Part I

Sponsored by Kansas NSF EPSCoR and K-BRIN

E-mail: [email protected]: 1002 Haworth

Pair-wise sequence alignment

What is pair-wise sequence What is pair-wise sequence alignment?alignment?

• Given two sequences of characters and a scoring scheme for comparing individual characters, it is a one-to-one mapping of equivalent characters between the two sequences that preserves the order of characters in both sequences:

Sequence 1: THISISASTRINGSequence 2: THISISANOTHERLONGERSTRING

THISISA----------------------STRING | | | | | | | | | | | | |

THISISANOTHERLONGERSTRING

Biological sequences consist of Biological sequences consist of characters that encode biological characters that encode biological

function.function.Proteins: 20 characters

(A,G,E,C,W,S,D,F,H,I,K,L,P,Q,M,N,V,Y,R,T)

CAP protein:

Unknown: NLRHRETSTLSGVTRETVTRTLGKLEKKGL

Known: GTPHRELSSI SGLARETVTRCLTRHFALCFhelix 1 helix 2turn

Here similar amino acid sequences perform the same function and have the same structure: helix-turn-helix motif.

Alignment is THE tool of Alignment is THE tool of bioinformatics. It is used to quantify bioinformatics. It is used to quantify and to visualize sequence similarity.and to visualize sequence similarity.

It is assumed that:

• “Good” alignment = Similar sequences • Similar sequences are likely to have similar function

and/or structure.• Can use knowledge-based approach –

if one sequence has known structure/function, sequence alignment can be used to “map” this knowledge onto other, similar sequences.

How to tell that alignment is “good”?How to tell that alignment is “good”?

• The total number of possible alignments is huge.• We need an objective function (scoring system) to identify the best alignment (called an optimal alignment).

Three possible variants for each Three possible variants for each position in a sequence position in a sequence

alignment :alignment :1. Exact match – a pair of identical characters are aligned.2. Inexact match (substitution) – two different characters are

aligned.3. Gap (insertion/deletion) – a character is missing in one

sequence.

ACT -ACGGATAGATTTAGGG - -AG

Total: 2 substitutions, 7 exact matches, 2 gaps

gap of length 2

gap of length 1

substitution

exact match

A general scoring system for A general scoring system for sequence alignment consists of sequence alignment consists of

two parts:two parts:1. A score for aligning a pair of characters (a,b):

S(a,b)=S(b,a)2. A penalty for each gap as a function of its length, k:

g=f(k) Then the total pair-wise alignment score between two sequences X and Y is:

gapsall

bapairsalignedall

gbasYXScore

),(

),(),(

X: ATGCTGGGA

Y: ATCCT - - GA

S(a,b)

g

For the highlighted position: a=G, b=C and S(G,C)=S(C,G)

Amino acid scoring matrix, Amino acid scoring matrix, s(a,b)s(a,b)

• A symmetric 20x20 matrix (we have 210 possible residue pairs)• If residues a and b are similar, s(a,b) > 0• If residues a and b are dissimilar, s(a,b) < 0

Gap penalty functionGap penalty function• It is gaps that make alignment computationally complex.• Structural meaning of gaps in proteins – a non-essential part of

the sequence is lost or a new part is added. If alignment is wrong, incorrect part will be assumed as inserted/deleted:

Seq 1: correctSeq 2:

Seq 1:Seq 2: incorrect

protein 1 protein 2

An example: scoring an alignment of An example: scoring an alignment of two DNA sequences using a similarity two DNA sequences using a similarity

matrixmatrix

sequence 1

sequence 2

Matrix for s(a,b)

We will use a linear affine gap penalty: g(k)=+*(k-1)

The meaning of the alignment score

When we perform a summation over all positions of the alignment of sequences X and Y, the total score S(X,Y) gives us an estimate of the likelihood that the alignment represents similarity compared to aligning X and Y at random:

S(X,Y) > 0 – more likely than random (many similar amino acids are aligned)

S(X,Y) < 0 – less likely than random (many dissimilar amino acids are aligned)

S(X,Y) = 0 – random alignment

Dot Plot – the simplest way to visualize pair-wise alignmentDot Plot – the simplest way to visualize pair-wise alignment

Put a dot in the cell (i,j) if character in row i is the same as in column j, A(i) = A(j).

palindromeATTA

invertedrepeat

AT - TA

On-line Java-based tool for computing Dot PlotsOn-line Java-based tool for computing Dot Plots

• http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

Optimal sequence alignmentsOptimal sequence alignments

We can compute two types of optimal alignments:• GLOBAL – aligns entire sequence 1 against entire sequence 2• LOCAL – aligns region of sequence 1 against region of sequence 2

Very often local alignment is a better choice, since it finds only similar subsequences and does not attempt to align totally unrelated segments.

Local alignment

Global alignment

Meaningless alignment

Optimal local alignmentOptimal local alignment

CCCCCCCCCCCCATATATATACCCCCCCCCCCC

vs.

GGGGATATTTATAGGGGGGGGGGG

ATAT ATATA

ATATTTATA

We want only this part

Sub-optimal local alignmentsSub-optimal local alignments

• In order to find all biologically meaningful similarities between two sequences we need to find all high scoring local alignments, not just the optimal one.

• Using a technique called de-clumping local alignment programs can compute any given number of sub-optimal alignments (2nd best, 3rd best, etc):

Optimal (best)3rd best2nd best

Sequence 1Sequence 2

Global alignment with zero end gap Global alignment with zero end gap penaltiespenalties

• Used to align short sequence with long sequence:

GCGCGCCGCGCGATATATATAGCGCGCCCGCGC

vs.

ATATTTTATATA

GCGCGCCGCGCGATAT ATATAGCGCGCCCGCGC ATATTTTATATA

These gaps are not penalized

This gap is penalized

• Percent identity = #identical pairs/alignment length• Percent similarity = #similar pairs/alignment length

(similar pairs are those that receive a positive score)• Alignment score = positions + gaps

Alignment length = 36Alignment score = 19Percent identity = 16.7%Percent similarity = 52.8%

Basic notation

Example: global vs. localExample: global vs. local

Optimal global alignment obtained using BLOSUM50, A=-12, B=-2

Optimal local alignmentobtained usingBLOSUM50, A=-12, B=-2

Global vs. Local:Global vs. Local:

• Use global alignment if – You expect, based on some biological information, that your

sequences will match over the entire length. – Your sequences are of similar length.

• Use local alignment if – You expect that only certain parts of two sequences will match (as

in the case of conserved segment that can be found in many different proteins).

– Your sequences are very different in length.– You want to search a sequence database (we will talk about it in

details later). – NOTE: local alignment works only with similarity matrices because

the total score must be able to take +, - and 0 values.

Selecting the right protein Selecting the right protein scoring matrixscoring matrix

• PAM matrices:– PAM250 is suitable for highly diverged sequences (>30% identity)

and is the best choice for aligning unknown proteins.– PAM160 is suitable for sequences which are 50-60% identical.– PAM40 is suitable for very similar proteins (70-90% identity)

• BLOSUM matrices:– BLOSUM50-62 is suitable for highly diverged sequences (>30%

identity) and is the best choice for aligning unknown proteins.– BLOSUM80 is suitable for 50% identity level.– BLOSUM90 is suitable for very similar proteins.

• PAM250 BLOSUM45, PAM160 BLOSUM62PAM120 BLOSUM80– Note that an increase in numbering in PAM corresponds to a

decrease in BLOSUM numbering and vice versa.

Lowering the gap penalties results in a Lowering the gap penalties results in a different optimal alignment with higher % different optimal alignment with higher % identity.identity.

Low gap penalties -> artificially high percent of identity betweensequences

Low complexity regionsLow complexity regions

• Real biological sequences have many regions where one or a few characters are over-represented (so-called low complexity regions):

ATGGPTIVLLVAAAAAAAAAAGPTPGLILW | | | | | | | | | EVVIKPSMCDHAAAATAAAAALCMKFC

• Such regions will bias the alignment because they tend to align with each other. Even absolutely unrelated sequences will have regions of false “similarity”.

• Mask these regions using PSEG for proteins, NSEG or DUST for DNA.

Masking low complexity regions in Masking low complexity regions in proteins (PSEG program)proteins (PSEG program)

General suggestions for doing pairwise General suggestions for doing pairwise alignmentalignment

1. Always translate protein-coding DNA sequences – protein sequences contain more information.

General suggestions for doing General suggestions for doing pairwise alignmentpairwise alignment

2. Make a dotplot of each sequence with itself• Look for low complexity regions:


2. Make a dotplot of each sequence with itself• Look for repeats:


1. Translate protein-coding DNA sequences – protein sequences contain much more information.

2. Make a pairwise dotplot and a dotplot of each sequence with itself.

• Watch out for low complexity regions and repeats. • Mask low complexity regions, remove repeats.

3. Do a local alignment first to see whether your sequences are similar over their entire lengths.

• If they are or if you are absolutely positive that despite the lack of apparent sequence similarity your sequences can be aligned globally, use global alignment.

• ALWAYS perform a visual inspection.


4. Use the default gap penalties supplied by the program first. The choice of gap penalties depends on the matrix.

5. Some matrices are good for distantly related sequences, some are good for closely related sequences.

6. Do alignments with different matrices and gap penalties. Regions that produce consistent alignments usually can be trusted.

7. Usually, the number of optimal alignments > 1.

8. The optimal alignment found by alignment program is not necessarily the correct one from the biological point of view.

• One of the nearly-optimal alignments may be what you need. You may need to adjust the optimal alignment manually according to your biological “hunch”.

Pairwise alignment programsPairwise alignment programs

1. Stand-alone programs from FASTA package http://fasta.bioch.virginia.edu/

• ALIGN – global alignment (affine gap penalties).

• ALIGN0 – global alignment with 0 end gap penalties (affine gap penalties).

• LALIGN – n best scoring local alignments (affine gap penalties).

2. On-line alignment programs available from KU Bioinformatics: http://jay.bioinformatics.ku.edu/EMBOSS/index.html

• NEEDLE (global alignment with affine gap penalties)

• WATER (local alignment with affine gap penalties)

• DotPlot

3. All possible types of alignments on one web-server at USC:• http://www-hto.usc.edu/software/seqaln/seqaln-query.html

Application of DP to search Application of DP to search sequence databasessequence databases..

Objective of sequence database Objective of sequence database searches:searches:

• Given a DNA or protein sequence, find all sequences in the database that are similar to this sequence.

• Main problem is how to pick all similar sequences and filter out all dissimilar sequences. Final result always depends on the definition of similarity.

Some terminology used in Some terminology used in sequence searchingsequence searching

• Query sequence – the sequence of interest which is compared to the database sequences.

• Significance threshold – the critical value of the variable (alignment score, probability, etc) used to assess the alignment between query sequence and a database sequence and to draw conclusions about similarity between them.

• Hit – an alignment between query sequence and database sequence that scores above the significance threshold.

SEARCHING SEQUENCE SEARCHING SEQUENCE DATABASESDATABASES

• OBJECTIVE: given a query sequence, find all database sequences that are significantly similar to the query sequence.

• IMPLEMENTATION:– Align query sequence to each database sequence using local

alignment (N best alignments in total). – Keep only those alignments that score higher than certain

significance threshold.

X X X

XX

X

Xhit

hit

hit

E-valueE-value

• The significance of observed similarity between query and database sequence is assessed using the E-value.

• E-value is a normalized statistic that gives the number of matches with a score equal to or greater than the observed score, x, that are expected to occur by chance when searching a database of given size N.

• The lower the E-value, the higher the chance that similarity between query and database sequence is significant.

• However, similarities we deem “significant” depend on an arbitrary choice of E-value.

The twilight zoneThe twilight zone

• When percent identity between two aligned sequences drops below 25-30%, the estimates of statistical significance fail to distinguish between related and unrelated sequences. In other words, true similarity cannot be distinguished from random match.

• This sequence identity threshold is usually referred to as the the twilight zone twilight zone of pairwise sequence alignment. This is the green area of the overlap we saw previously.

• Proteins with similar structure/function that have pairwise sequence identity below 25-30% can score lower than structurally and functionally dissimilar proteins.

Mathematically Rigorous Search: Mathematically Rigorous Search:

Smith-Waterman MethodSmith-Waterman Method1. SSEARCH – a stand-alone program from FASTA

package http://fasta.bioch.virginia.edu/

2. On-line alignment available from EMBL-EBI: http://www.ebi.ac.uk/MPsrch/

• SW search with column-cost and affine gap penalties

• SW is the most rigorous method to do a database search using a single query sequence. It is also the slowest one.• Heuristic search methods are much faster: BLAST – Basic Local Alignment Search Tool.

The main idea of BLAST

Protein BLAST requires two neighboring hits of length W

(W=3 by default)

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT exact match

Nucleotide BLAST looks for one exact match of length W

(W=11 by default)

hit 1 hit 2

A

W=3 W=3Query sequence

Database sequence

An alignment that BLAST can’t find using default parameters

because there is no hit of length 11

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |

1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT

| || || || ||| || | |||||| || | |||||| ||||| | |

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC

|||| || ||||| || || | | |||| || |||

121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Significance of BLAST hits

• Statistically significant does not always imply “biologically meaningful”.

• No strict rule how to choose E-value. Always a trade-off between sensitivity and specificity.

• E-value < 10-6 – almost surely homologous. Will miss remote homologues.

• Between 10-2 and 10-6 – probably homologous.• Between 10-2 and 10 – might or might not be

interesting… • In any case, use your own judgments.• Pairwise sequence comparison cannot detect remote

homologues reliably when sequence identity drops below 25-30%. Need more sophisticated approaches.

Protein databases for BLAST

DNA databases for BLAST

Important points to keep in mind when doing a sequence search

• Always translate cDNA sequences – protein sequences are much more informative (4 characters in DNA vs. 20 characters in proteins). Protein scoring matrices are also more informative.

• Filter low complexity regions (both DNA and proteins)• Mask interspersed repeats in DNA using

RepeatMasker.• Proteins: start with BLOSUM62, repeat search with

BLOSUM45• DO NOT change default gap penalties !!!

BLAST search

with BLOSUM62, BLOSUM45

query sequence

Smith-Waterman search

with BLOSUM62, BLOSUM45

Search against profiles and HMMs

No similarity

No similarity

No similarity = trouble

Multiple Sequence Alignment

What is Multiple Sequence What is Multiple Sequence Alignment (MSA)?Alignment (MSA)?

• Alignment of three or more sequences. • Each column contains characters that are equivalent

across all sequences.• Put a gap if sequence has no equivalent character in

given column:

Two problems with MSATwo problems with MSA

1. There is no good way to score multiple sequence alignments.

2. Optimal answer cannot be found for more than 6-7 sequences because of extremely high computational complexity.

– MSA is performed using heuristic methods that DO NOT guarantee an optimal solution.

– Progressive MSA is the most widely used method.

Alignment using the guide tree

Alignment of each sequence pair (i,j)

Distance for each sequence pair, d(i,j)

Guide tree

Once a gap, always a gapOnce a gap, always a gap

• A tricky part of progressive MSA is that we have to align two alignments. Scoring gaps is the biggest problem.

• To simplify the situation, gaps introduced at earlier stages of MSA cannot be changed as new sequences are added to the alignment.

Problems with progressive MSAProblems with progressive MSA

• A heuristic solution, no guarantee of optimal alignment.• The initial guide tree is not reliable and does not always give

correct relationship between aligned sequences.• Alignment errors made at initial stages of MSA cannot be

corrected and accumulate as more sequences are added to the alignment.

• Progressive alignment becomes very unreliable for very diverged sequences (below 30% identity).

• Only global alignment is used (attempts to align all parts of the sequences). Does not work well if sequences have very different lengths.

Programs for progressive MSAPrograms for progressive MSA

• PILEUP – commercial, from GCG package.• CLUSTAL – freeware, can be used both on-line

(http://www.ebi.ac.uk/clustalw/) and installed locally on Windows OS.

• CLUSTAL has certain advantages over PILEUP:– Weighting of sequences (removes effect of over-represented

similar sequences)– Empirical position-specific gap penalties (lower weights in

stretches of amino acids that may be variable loop regions)– Can align two user-supplied multiple alignments or align one

sequence with an existing MSA.

MSA should MSA should ALWAYSALWAYS be visually be visually inspected and, in most cases, edited inspected and, in most cases, edited

manuallymanuallyMSA editors:

• GeneDoc (WINDOWS, http://www.psc.edu/biomed/genedoc/)– Can be used to color MSA according to the secondary structure of

one sequence (if structure is known) or various a.a. properties– Provides reports on a.a. composition, pairwise scores

• BioEdit (WINDOWS, http://www.mbio.ncsu.edu/BioEdit/bioedit.html)– Has a lot of features. Can be integrated with accessory applications

and run CLUSTAL, BLAST, tree construction programs

• CINEMA (Java applet, run from web-browser, no installation http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/)

• MACAW (WINDOWS, Mac, current NCBI link is broken)

BioEdit

• Has a variety of options for both DNA and protein sequences:Search, Translate, Reverse, DotPlot, Edit and Shade MSA, etc.

• Works with a variety of formats including FASTA and Clustal

• BEST feature: capable of handling huge datasets

GenDoc

• Edit and Shade MSA, etc.

• Works with a variety of formats including FASTA and Clustal.

• BEST feature: can shade protein sequences using a secondary structure mask from DSSP.

MACAW• Old program. BEST feature: can construct MSA using

conserved blocks found with LOCAL pair-wise sequence alignment and Gibbs sampler.

Incorporation of local alignment features Incorporation of local alignment features into MSAinto MSA

• T-Coffee is an “add-on” program that interacts with CLUSTAL and uses both local and global alignment information to improve the results of MSA.

• Has been shown to improve the accuracy of CLUSTAL in some cases when sequences with low identity are being aligned.

Available on-line at: http://www.ch.embnet.org/software/TCoffee.html

Very time-consuming. Can be installed locally (requires CLUSTAL). Local version is available at KU Bioinformatics server.

T-Coffee

Main features:

• Combines alignment information from heterogeneous sources, including pair-wise local alignments, into a library.

• Uses the library to perform a progressive alignment.

Alignment of cDNA in codon-to-codon mode

Synonymous substitution

http://jay.bioinformatics.ku.edu/~codon/

The easiest way to use multiple The easiest way to use multiple sequences to search for remote sequences to search for remote

homologues: homologues: Position Specific Iterative BLAST Position Specific Iterative BLAST

(PSI-BLAST)(PSI-BLAST)

• Uses MSA in the form of Position Specific Scoring Matrix (PSSM) to detect remote homologues. Works for proteins only.

• Unlike standard BLAST, database sequences are aligned with PSSM not initial query sequence.

• PSI-BLAST is CONSIDERABLY more sensitive in finding divergent members of the same protein family than regular BLAST.

• The easiest way to use MSA for sequence search in completely automated mode.

Position Specific Scoring Matrix (PSSM)

Multiple SequenceAlignment

PSI-BLAST stepsPSI-BLAST steps

Protein database

BLAST search using PSSM

Standard BLASTP search

PSSM

Multiple sequence alignment

Query sequence

Output (sequence hits to PSSM)

Filter sequencesmanually

Filter sequences by inclusion E-value

Standard amino acid scoring matrix (e.g., BLOSUM62)

Example of how PSI-BLAST can be used to detect Example of how PSI-BLAST can be used to detect remote sequence similarities: 1AUX (remote sequence similarities: 1AUX (synapsinsynapsin) ) and 1GLV have only 17.5% sequence identity. and 1GLV have only 17.5% sequence identity.

Pairwise search methods fail.Pairwise search methods fail.

Use sequence of 1AUX to run PSI-

BLAST search

Sequences with E-value below thisthreshold will be used to construct PSSM

1st iteration is a regular BLAST search

2.5 times 2.5 times more hitsmore hits

PSI-BLAST iteration 2PSI-BLAST iteration 2


4 times 4 times more hitsmore hits


15 times 15 times more hitsmore hits

New hits with knownNew hits with knownstructure. Appearedstructure. Appearedonly on 4only on 4thth iteration. iteration.One of them is 1GLV

1GLV1AUX

Synapsin 1AUX and Glutathione synthetase 1GLV have very Synapsin 1AUX and Glutathione synthetase 1GLV have very similar structuressimilar structures

A word of caution on using PSI-BLASTA word of caution on using PSI-BLAST

• Avoid including very close sequences in order not to over-fit.

• Unrelated sequences may be included. Always check sequences on each iteration using biological knowledge.

• Results depend on the choice of initial query sequence. Some query sequences will produce considerably better results.

• Powerful, but results may be misleading if too many irrelevant sequences were included.

• E-value should be treated very carefully.

De-novoDe-novo motif detection motif detection

• MOTIF – a conserved region shared by many related sequences. These sequences can be functionally related, structurally related, or both.– Examples: TF-binding sites in upstream regions of co-regulated

genes.

Procedure:• Take a set of sequences that are presumed to contain

at least one unknown common motif.• Try to find this motif by comparing all similar segments

found in these sequences and using statistics to filter out random similarities.

Searching for ungappedSearching for ungapped LOCAL LOCAL MSAsMSAs

AAAAAAAAAAAGGGCGGAAAAA

TTTTTTTTTTTTTTTTTTTTTTGGGGGGTTTT

CCGTGGGGCCCAAAAA

AAAAACTGGGGGGCTCTCTCTCTCTCTCTCTC


TTTTTTTTTTTTTTTTTTTTTTGTGGGGTTTT

CCGGGGGGCCCAAAAA



CCGTGGGGCCCAAAAA


+

1ST motif

2nd motif

MEME and MAST

• MEME (Multiple Em for Motif Elicitation) – finds one or more ungapped motifs in a set of DNA or protein sequences. Analyses all possible motif sizes and locations. http://meme.sdsc.edu/meme/website/intro.htmlAlso available at KU Bioinformatics server.

• MAST (Motif Alignment and Search Tool) – uses motifs detected by MEME to search sequence databases (the same web site).

• Advantages: the most user-friendly program, supported, runs on multiple processors.

• Disadvantages: not the most sensitive program, misses weak motifs. Time-consuming.

MEME

• On-line version does not allow to change all arguments.• Main program arguments:

– OOPS (One Occurrence Per Sequence) – each sequence contains exactly one occurrence of a particular motif. Most sensitive to weak motifs. However, if some sequences do not contain the motif, it will be “blurry”.

– ZOOP (Zero or One Occurrence Per Sequence) – each sequence contains one or zero occurrences of a particular motif. May miss weak motifs, but will not include irrelevant sites.

– TCM – any number of motifs per sequence. Will find repeated motifs within each sequence.

– Number of motifs to find– E-value threshold – the program find motifs with the lowest E-

value first, stops when E-value goes above this threshold.

sequence alignments and sequence database searching igor kuznetsov bioinformatics workshop part i

Documents

alignment of sequences

sequences of characters

good alignment

best alignment

alignment scorewhen

optimal alignment

sequence similarity

sequence alignments