sequence alignment techniques

Download Sequence Alignment Techniques

Post on 31-Dec-2015




0 download

Embed Size (px)


Sequence Alignment Techniques. In this presentation……. Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment. Part 1. Searching for Sequence Similarity. Sequence similarity searches. - PowerPoint PPT Presentation


  • Sequence Alignment Techniques

  • In this presentationPart 1 Searching for Sequence SimilarityPart 2 Multiple Sequence Alignment

  • Part1Searching for Sequence Similarity

  • Sequence similarity searchesSequence similarity searches of database enable us to extract sequences that are similar to a query sequenceInformation about these extracted sequences can be used to predict the structure or function of the query sequencePrediction using similarity is a powerful and ubiquitous idea in bioinformatics. The underlying reason for this is molecular evolution

  • Sequence alignmentAny pair of DNA sequence will show some degree of similaritySequence alignment is the first step in quantifying this in order to distinguish between chance similarity and real biological relationshipsAlignments show the differences between sequences and changes (mutations), insertions or deletions (indels or gaps) and can be interpreted in evolutionary terms

  • Alignment algorithmsDynamic programming algorithms can calculate the best alignment of two sequencesWell-known variants arethe Smith-Waterman algorithm (local alignments)the Needleman-Wunsch algorithm (global alignments)Local alignments are useful when sequences are not related over their full lengths, e.g., proteins sharing only certain domains or DNA sequences related only in exons

  • Alignment scores and gap penaltiesA simple alignment score measures the number or proportion of identically matching residuesGap penalties are subtracted from such scores to ensure that alignment algorithms produce biologically sensible alignments without many gapsGap penalties may be constant (independent of the length of the gap), proportional (proportional to the length of the gap) or affine (containing gap opening and gap extension contributions)Gap penalties can be varied according to the desired application

  • Similarity and homologySimilarity may exist between any sequencesSequences are homologous only if they have evolved from a common ancestorHomologous sequences often have similar biological functions (orthologs), but the mechanism of gene duplication allows homologous sequences to evolve different functions (paralogs)

  • Similarity search in databasesSequences similar to a query can be found in a database by aligning it to each database sequence in turn and returning the highest scoring (most similar) sequencesThis can be achieved by dynamic programming algorithms but in practice faster approximate methods are often used

  • Statistical scoresThe p value of a similarity score is the probability of obtaining a score at least as high in a chance similarity between two unrelated sequences of similar compositionLow p values indicate significance matches that are likely to have real biological significanceThe related E value is the expected frequency of chance occurrences scoring at least as high as the identified similarityA low p value for a similarity between two sequences can translate into a high E value for a search of a large database

  • Sensitivity and specificityThese measures quantify the success of a database search strategySensitivity measures the proportion of real biological sequence relationships in the database that were detected as hits in the searchSpecificity is the proportion of the hits corresponding to real biological relationshipsChanging E and p value thresholds results in a trade-off between these complementary measures of success

  • Maximizing amino acid identitiesProtein sequences can be aligned to maximize amino acid identities, but this will not reveal distant evolutionary relationships

  • EvolutionProtein-coding sequences evolve slowly compared with most other parts of the genome, because of the need to maintain protein structure and functionAn exception to this is the fast evolution that might occur in the redundant copy of a recently duplicated gene

  • Allowed changesChanges in protein sequences during evolution tend to involve substitutions between amino acids with similar properties because these tend to maintain the structural stability of the protein

  • Substitution score matricesThese matrices give scores for all possible amino acid substitutions during evolutionHigher scores indicate more likely substitutionsExample matrices are BLOSUM62 and PAM250PAM stands for Accepted Point Mutations, and in this case, the evolutionary distance of the matrix is 250 amino acid changes per 100 residuesDynamic programming algorithms for sequence alignment can operate using scores from these matrices

  • Significance of score matricesSubstitution score matrices allow detection of distant evolutionary relationships between protein sequencesIt is possible to detect much more distant relationships by comparing protein sequences than by comparing nucleic acid sequences


  • A dot plot of human pleckstrin sequence against itself produced with Erik Sonnhammers dotter program. The sequence is plotted from N- to C- terminus along horizontal and vertical axes between residues 1 and approximately 350. PLEK_HUMAN (horizontal) vs. PLEK_HUMAN (vertical)

  • The PAM250 matrix and alignment of sequences. Total alignment scores for two matrices should not be compared, but note that the PAM matrix is able to detect a much better alignment in second halves of these sequences rather than identity matrix. With the introduction of a single gap, sensible alignments of hydrophobic amino acids, and alignment of K with R (both basic), D with E (both acidic) and F with Y (both aromatic) can be seenC 12S 0 2T 2 1 3P 1 1 0 6A 2 1 1 1 2G 3 1 0 1 1 5N 4 1 0 1 0 0 2D 5 0 0 1 0 1 2 4E 5 0 0 1 0 0 1 3 4Q 5 1 1 0 0 1 1 2 2 4B 3 1 1 0 1 2 2 1 4 3 6R 4 0 1 0 2 3 0 1 1 1 2 5K 5 0 0 1 1 2 1 0 0 1 0 3 5M 5 2 1 2 1 3 2 3 2 1 2 0 0 6I 3 1 0 2 1 3 2 2 2 -2 -2 -2 2 2 5L 6 3 2 3 2 4 3 4 3 -2 -2 3 3 4 2 6V 2 3 0 1 0 1 2 2 4 -2 -2 2 2 2 4 2 4F 4 3 3 5 4 5 4 6 5 -5 2 4 5 0 1 2 1 9Y 0 3 3 5 3 5 2 4 4 4 0 4 4 2 1 1 2 7 10W 8 2 5 6 6 7 4 7 7 5 3 2 3 4 5 5 6 0 0 17 C S T P A G N D E Q H R K M I L V F Y WSequence 1: MIIVKP VVLKGDFGSequence 2: MILLKP AIIIRAEY-Position score: 656256 044231370

  • Figure 3. Display of the DNA unit. DNA can be described at several levels of detail. At the most detailed level, DNA can be characterized by the 5' and 3' termini at both external and internal positions; at the most abstract level, the substrate DNA can be one of 16 common structures. The goal is to provide methods for specifying the properties of DNA in as many ways as is natural for a scientist.

  • Figure 7. An initial experimental environment. The temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible

  • Part2Multiple Sequence Alignment

  • Non specific sequence similarityCertain types of sequence similarity are less likely to be indicative of an evolutionary relationship than others areExamples of this are similarity between regions of low compositional complexity, short period repeats and protein sequences coding for generic structures like coiled coils

  • Similarity search filtersRegions of the non specific sequence types can degrade the results of similarity searches and are often filtered out of query sequences prior to searchingThe programs SEG and DUST can be used to detect and filter low complexity sequences, XNU can filter short period repeats and COILS can detect the presence of potential coiled coil structures

  • Database types for searchesDatabase and query sequences can be protein or nucleic acid sequences and different query strategies are required for different types and combinationsIn general, searches are more sensitive using strategies where protein-coding nucleic acid database and/or query sequences are first translated to protein sequences

  • Iterative database searchesPSI-BLAST is an iterative search method that improves on the detection rate of BLAST and FASTAEach iteration discovers intermediate sequences that are used in a sequence profile to discover more distant relatives of the query sequence in subsequent iterationsPotential problems with PSI-BLAST are associated with the potential for unrelated sequences to pollute the iterative search, and difficulties associated with the domain structure of proteinsPSI-BLAST often detects up to twice as many evolutionary relationships as BLAST

  • Multiple sequence alignmentMultiple alignment illustrates relationships between two or more sequencesWhen the sequences involved are diverse, the conserved residues are often key residues associated with maintenance of structural stability or biological functionMultiple alignments can reveal many clues about protein structure and functions

  • Multiple alignmentPart of a (artificial) multiple alignment of a family consisting of 7 sequences, which subdivide into 3 subfamilies. The bars on the left indicate subfamilies; the dotted boxes highlight conservation patterns.

  • Progressive sequence alignmentMost commonly used software uses the method of progressive alignmentThis is a fast method, but frozen-in errors mean that it does not always work perfectlyBiological knowledge can provide inf