pairwise sequence alignments

Pairwise Sequence Alignments

Bioinformatics

Some Bioinformatics Programming Terminology

Model

• A model is a set of propositions or equations describing in simplified form some aspects of experience.

• A valid model includes all essential elements and their interactions of the concept or system it describes.

Algorithm

• An algorithm is a complete, unambiguous procedure for solving a specified problem in a finite number of steps.

• Algorithms leave nothing undefined and require no intuition to achieve their end.

Five Features of an Algorithm:

• An algorithm must stop after a finite number of steps.

• All steps of the algorithm must be precisely defined.

• Input to the algorithm must be specified. • Output of the algorithm must be specified.

There must be at least one output. • An algorithm must be effective - i.e. its

operations must be basic and doable.

Data Structures: Foundation of an Algorithm

• One of the most important choices of writing a program.

• For the same operation, different data structures can lead to vastly more or less efficient algorithms.

• The design of data structures and algorithms goes hand in hand.

• Once the data structure is well defined, usually the algorithm can be simple.

Data Structure Primitives:

• Strings • Arrays

A String Is a Linear Sequence of Characters.

• This implies several important properties:

• Finite strings have beginnings and ends. Thus they also have a length.

• Strings imply an alphabet. • The elements of a strings are ordered.• A string is a one dimensional array.

Two Dimensional Arrays

Pairwise Sequence Alignment is Fundamental to Bioinformatics

• It is used to decide if two proteins (or genes) are related structurally or functionally

• Two Dimensional Arrays are the basis of Pairwise Alignments

• It is used to identify domains or motifs that are shared between proteins

• It is the basis of BLAST searching • It is used in the analysis of genomes

Are there other sequences like this one?

• Huge public databases - GenBank, Swissprot, etc.

• Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes

• Similarity searching is based on alignment between two strings in a 2-D array

Why Search for Similarity?

1. I have just sequenced something. What is known about the thing I sequenced?

2. I have a unique sequence. Is there similarity to another gene that has a known function?

3. I found a new protein in a lower organism. Is it similar to a protein from another species?

4. I have decided to work on a new gene. The people in the field will not give me the plasmid. I need the complete cDNA sequence to perform RT-PCR of some other experiment.

Definitions• Similarity: The extent to which nucleotide

or protein sequences are related. It is based upon identity plus conservation.

• Identity: The extent to which two sequences are invariant.

• Conservation: Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + + GTW++ MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEI V LHRWEN 81

Definitions

• Identical - When a corresponding character is shared between two species or populations, that character is said to be identical.

• Similar - The degree to which two species or populations share identities.

• Homologous - When characters are similar due to common ancestry, they are homologous.

Evolution and Alignment• Homology - two (or more) sequences

have a common ancestor. • This is a statement about evolutionary

history. • Similarity - two sequences are similar,

by some criterion. • It does not refer to any historical process,

just to a comparison of the sequences by some method.

• It is a logically weaker statement.

Caution

• In bioinformatics these two terms are often confused and used interchangeably.

• The reason is probably that significant similarity is such a strong argument for homology.

Similarity ≠ Homology

1) 25% similarity ≥ 100 AAs is strong evidence for homology

2) Since homology is an evolutionary statement, there should be additional evidence which indicates “descent from a common ancestor”

– common 3D structure– usually common function

3) Homology is all or nothing. You cannot say "50% homologous"

retinol-binding protein(NP_006735)

b-lactoglobulin(P02754)

Page 42

Definitions, Con’t.• Analogous - Characters are similar due to

convergent evolution. • Orthologous - Homologous sequences (or

characters) between different species that descended from a common ancestral gene during speciation; They may or may not be responsible for a similar function.

• Paralogous - Homologous sequences within a single species that arose by gene duplication.

• Homology is therefore NOT synonymous with similarity.

• Homology is a judgment, similarity is a measurement.

Proteins or Genes Related by Evolution Share a Common

Ancestor• Random mutations in the sequences

accumulate over time, so that proteins or genes that have a common ancestor far back in time are not as similar as proteins or genes that diverged from each other more recently.

• Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments.

Function is Conserved

• Alignments can reveal which parts of the sequences are likely to be important for the function, if the proteins are involved in similar processes.

• In parts of the sequence of a protein which are not very critical for its function, random mutations can easily accumulate.

• In parts of the sequence that are critical for the function of the protein, hardly any mutations will be accepted; nearly all changes in such regions will destroy the function.

Sequence Alignments

• Comparing sequences provides information as to which genes have the same function

• Sequences are compared by aligning them – sliding them along each other to find the most matches with a few gaps

• An alignment can be scored – count matches, and can penalize mismatches and gaps

• It is much easier to align proteins. Why?

Why Search with Protein, not DNA Sequences?

1) 4 DNA bases vs. 20 amino acids - less chance similarity

2) can have varying degrees of similarity between different AAs- # of mutations, chemical similarity, PAM

matrix

3) protein databanks are much smaller than DNA databanks

Similarity is Based on Dot Plots

1) two sequences on vertical and horizontal axes of graph

2) put dots wherever there is a match

3) diagonal line is region of identity (local alignment)

4) apply a window filter - look at a group of bases, must meet % identity to get a dot

Definition

• Pairwise alignment:• The process of lining up two or more

sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.

Dot Plots

A Simple Way to Measure Similarity

Simple Dot Plot

G G C T T G A C C G G

G A A A A A

G A A A A A

A A

T A A

T A A

G A A A A A

A A

C A A A

C A A A

C A A A

G A A A A A

Dot plot filtered with 4 base window and 75%

identityG A T C A A C T G A C G T A

G T T C A G C T G C G T A C

Dot matrix provides visual picture of alignment

• It is used to easily spot segments of good sequence similarity.

• The two sequences are placed on each side of 2-dimensional matrix, and each cell in the matrix is then filled with a value for how well a short window of the sequences match at that point.

Simple Dot Plot


G A A A A A

G A A A A A

A A

T A A

T A A

G A A A A A

A A

C A A A

C A A A

C A A A

G A A A A A

A Limitation to Dot Matrix Comparison

• Where part of one sequence shares a long stretch of similarity with the other sequence, a diagonal of dots will be evident in the matrix.

• However, when single bases are compared at each position, most of the dots in the matrix will be due to background similarity.

• That is, for any two nucleotides compared between the two sequences, there is a 1 in 4 chance of a match, assuming equal frequencies of A,G,C and T.

A Solution

• This background noise can be filtered out by comparing groups of l nucleotides, rather than single nucleotides, at each position.

• For example, if we compare dinucleotides (l = 2), the probability of two dinucleotides chosen at random from each sequence matching is 1/16, rather than 1/4.

• Therefore, the number of background matches will be lower:

A Filtered Dot Plot


G A A

G A

A

T A

T A

G A

A A

C A

C A

C A

G

The Dot Matrix Algorithm

• The dot-matrix algorithm can be generalized for sequences s and t of sizes m and n, respectively, and window size l.

• For each position in sequence s, compare a window of l nucleotides centered at that position with each window of l nucleotides in sequence t.

• Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.

• The dot-matrix algorithm can be generalized for sequences s and t of sizes m and n, respectively, and window size l.

• For each position in sequence s, compare a window of l nucleotides centered at that position with each window of l nucleotides in sequence t.

• Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.

Dot Matrix Sequence Comparison Examples

Examples

• These examples comes from the webpage www.bioinformaticsonline.org

• This has a nice discussion of results in another package, DNA Strider.

• I used COMPARE program in SeqWeb.• I used the BLOSUM 62 scoring matrix.

http://www.bioinformaticsonline.org/

Comparing a Protein with Itself

• Proteins can be compared with themselves to show internal duplications or repeating sequences.

• A self-matrix produces a central diagonal line through the origin, indicating an exact match between the x and y axes.

• The parallel diagonals that appear off the central line are indicative of repeated sequence elements in different locations of the same protein.

Haptoglobin

• Haptoglobin is a protein that is secreted into the blood by the liver. This protein binds free hemoglobin.

• The concentration of "free" hemoglobin (that is, outside red blood cells) in plasma (the fluid portion of blood) is ordinarily very low.

• However, free hemoglobin is released when red blood cells hemolyze for any reason.

• After haptoglobin binds hemoglobin, it is taken up by the liver.

• The liver recycles the iron, heme, and amino acids contained in the hemoglobin protein.

Our Comparison

• Files used– 1006264A Haptoglobin H2

• DNA sequencing shows that the intragenic duplication within the human haptoglobin Hp2 allele was formed by a non-homologous, probably random, crossing-over within different introns of two Hp1 genes.

• A repeated sequence (starting with ADDGCP...) is observed beginning at positions 30-90 and 90-150 - probably due to a duplication event in one of these locations.

Window: 30 Stringency: 3Blosum 62 matrix

• One of the strengths of dot-matrix searches is that they make repeats easy to detect by comparing a sequence against itself.

• In self comparisons, direct repeats appear as diagonals parallel to the main line of identity.

Comparison of Two Similar Sequences

Our Comparison• Files Used:

– P03035 • Repressor protein from E. coli Phage p22

– RPBPL • Repressor protein from E. coli phage Lambda

• Lambda phages infect E. coli. They can be lytic and destroys the host cell, making hundreds of progeny.

• They can also be lysogenic, and live quietly within the DNA of the bacteria.

• A gene makes the repressor protein that prevents the phage from going destructively lytic.

• Phage p22 is a related phage that also makes a repressor.

• Both proteins form a dimer and bind DNA to prevent lysis.

Dot Matrix Sequence Comparison

• A row of dots represents a region of sequence similarity.

• Background matching also appears as scattered dots.

• There is a decrease in background noise as window and stringency parameters increase.

BLAST Sequence Alignment

• Perform a search of all sequences in a database for a match to a query sequence - BLAST search.– BLAST is an acronym for Basic Local

Alignment Search Tool.

• Search for patterns or domains in a sequence.

Disadvantages to Dot Plots

• While dot-matrix searches provide a great deal of information in a visual fashion, they can only be considered semi-quantitative, and therefore do not lend themselves to statistical analysis.

• Also, dot-matrix searches do not provide a precise alignment between two sequences.

Some Definitions for Sequence Alignments

Gaps and Insertions

• In an alignment, much better correspondence can be obtained between two sequences if a gap can be introduced in one sequence.

• Alternatively, an insertion could be allowed in the other sequence.

• Biologically, this corresponds to a mutation event that eliminates a part of a gene, or introduces new DNA into a gene.

Gaps

• Positions at which a letter is paired with a null are called gaps.

• Gap scores are typically negative. • Since a single mutational event may

cause the insertion or deletion of more than one residue, the presence of a gap is considered more significant than the length of the gap.

Optimal Alignment

• The alignment that is the best, given a defined set of rules and parameter values for comparing different alignments.

• There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on.

• For example, what penalty should gaps carry? • All sequence alignment procedures make

some such assumptions.

Global Alignment

• An alignment that assumes that the two strings are basically similar over the entire length of one another.

• The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing.

• A tiny example: LGPSTKDFGKISESREFDN | |||| | LNQLERSFGKINMRLEDA

Local Alignments

• An alignment that searches for segments of the two sequences that match well.

• There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get:

----------FGKI---------- |||| ----------FGKI----------

Local Alignments

• It may seem that one should always use local alignments.

• However, it may be difficult to spot an overall similarity, as opposed to just a domain-to-domain similarity, if one uses only local alignment.

• So global alignment is useful in some cases. • The popular programs BLAST and FASTA for

searching sequence databases produce local alignments.

Are there other sequences like this one?

1) Huge public databases - GenBank, Swissprot, etc.

2) Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes

3) Similarity searching is based on alignment4) BLAST and FASTA provide rapid similarity

searchinga. rapid = approximate (heuristic)b. false + and - scores

Global vs. Local similarity1) Global similarity uses complete aligned

sequences - total % matches – GCG GAP program, Needleman & Wunsch

algorithm

2) Local similarity looks for best internal matching region between 2 sequences – GCG BESTFIT program, – Smith-Waterman algorithm, – BLAST and FASTA

3) dynamic programming – optimal computer solution, not approximate

What Program to use When?

1) BLAST is fastest and easily accessed on the Web– limited sets of databases– nice translation tools (BLASTX, TBLASTN)

2) FASTA works best in GCG– integrated with GCG– precise choice of databases– more sensitive for DNA-DNA comparisons– FASTX and TFASTX can find similarities in sequences with

frameshifts

3) Smith-Waterman is slower, but more sensitive – known as a “rigorous” or “exhaustive” search– SSEARCH in GCG and standalone FASTA

Sequence Alignments

• Sometimes only parts of sequences match e.g. domain (longer) or motif (shorter) of a protein or a regulatory pattern in DNA

• Poor alignments can be misleading – you have to learn to recognize and test the significance of an alignment

Comparing the protein kinase KRAF_HUMAN and the uncharacterized O22558 from

Arabidopsis using BLAST

546 AA Score = 185 bits (464), Expect = 1e-45 Identities = 107/283 (37%), Positives = 172/283 (59%), Gaps = 15/283 (5%)

Query: 337 DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA 395 D + WEI+ +++ + ++ SGS+G +++G + +VA+K LK E + F EV Sbjct: 274 DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF 333

Query: 396 VLRKTRHVNILLFMGYMTKD-NLAIVTQWCEGSSLYKHLHVQETKFQMFQLIDIARQTAQ 454 ++RK RH N++ F+G T+ L IVT++ S+Y LH Q+ F++ L+ +A A+ Sbjct: 334 IMRKVRHKNVVQFLGACTRSPTLCIVTEFMARGSIYDFLHKQKCAFKLQTLLKVALDVAK 393

Query: 455 GMDYLHAKNIIHRDMKSNNIFLHEGLTVKIGDFGLATVKSRWSGSQQVEQPTGSVLWMAP 514 GM YLH NIIHRD+K+ N+ + E VK+ DFG+A V+ SG E TG+ WMAP Sbjct: 394 GMSYLHQNNIIHRDLKTANLLMDEHGLVKVADFGVARVQIE-SGVMTAE--TGTYRWMAP 450

Query: 515 EVIRMQDNNPFSFQSDVYSYGIVLYELMTGELPYSHINNRDQIIFMVGRGYASPDLSKLY 574 EVI ++ P++ ++DV+SY IVL+EL+TG++PY+ + + +V +G P + K Sbjct: 451 EVI---EHKPYNHKADVFSYAIVLWELLTGDIPYAFLTPLQAAVGVVQKG-LRPKIPK-- 504

Query: 575 KNCPKAMKRLVADCVKKVKEERPLFPQILSSIELLQHSLPKIN 617 K PK +K L+ C + E+RPLF +I IE+LQ + ++N Sbjct: 505 KTHPK-VKGLLERCWHQDPEQRPLFEEI---IEMLQQIMKEVN 543

pairwise sequence alignments

Documents

algorithman algorithm

sequence comparison

unique sequence

finite strings

complete cdna sequence

linear sequence of characters

design of data structures

different data structures