pairwise sequence alignments

67
Pairwise Sequence Alignments Bioinformatics

Upload: odessa

Post on 11-Jan-2016

77 views

Category:

Documents


3 download

DESCRIPTION

Pairwise Sequence Alignments. Bioinformatics. Some Bioinformatics Programming Terminology. Model. A model is a set of propositions or equations describing in simplified form some aspects of experience. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pairwise Sequence Alignments

Pairwise Sequence Alignments

Bioinformatics

Page 2: Pairwise Sequence Alignments

Some Bioinformatics Programming Terminology

Page 3: Pairwise Sequence Alignments

Model

• A model is a set of propositions or equations describing in simplified form some aspects of experience.

• A valid model includes all essential elements and their interactions of the concept or system it describes.

Page 4: Pairwise Sequence Alignments

Algorithm

• An algorithm is a complete, unambiguous procedure for solving a specified problem in a finite number of steps.

• Algorithms leave nothing undefined and require no intuition to achieve their end.

Page 5: Pairwise Sequence Alignments

Five Features of an Algorithm:

• An algorithm must stop after a finite number of steps.

• All steps of the algorithm must be precisely defined.

• Input to the algorithm must be specified. • Output of the algorithm must be specified.

There must be at least one output. • An algorithm must be effective - i.e. its

operations must be basic and doable.

Page 6: Pairwise Sequence Alignments

Data Structures: Foundation of an Algorithm

• One of the most important choices of writing a program.

• For the same operation, different data structures can lead to vastly more or less efficient algorithms.

• The design of data structures and algorithms goes hand in hand.

• Once the data structure is well defined, usually the algorithm can be simple.

Page 7: Pairwise Sequence Alignments

Data Structure Primitives:

• Strings • Arrays

Page 8: Pairwise Sequence Alignments

A String Is a Linear Sequence of Characters.

• This implies several important properties:

• Finite strings have beginnings and ends. Thus they also have a length.

• Strings imply an alphabet. • The elements of a strings are ordered.• A string is a one dimensional array.

Page 9: Pairwise Sequence Alignments

Two Dimensional Arrays

Page 10: Pairwise Sequence Alignments

Pairwise Sequence Alignment is Fundamental to Bioinformatics

• It is used to decide if two proteins (or genes) are related structurally or functionally

• Two Dimensional Arrays are the basis of Pairwise Alignments

• It is used to identify domains or motifs that are shared between proteins

• It is the basis of BLAST searching • It is used in the analysis of genomes

Page 11: Pairwise Sequence Alignments

Are there other sequences like this one?

• Huge public databases - GenBank, Swissprot, etc.

• Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes

• Similarity searching is based on alignment between two strings in a 2-D array

Page 12: Pairwise Sequence Alignments

Why Search for Similarity?

1. I have just sequenced something. What is known about the thing I sequenced?

2. I have a unique sequence. Is there similarity to another gene that has a known function?

3. I found a new protein in a lower organism. Is it similar to a protein from another species?

4. I have decided to work on a new gene. The people in the field will not give me the plasmid. I need the complete cDNA sequence to perform RT-PCR of some other experiment.

Page 13: Pairwise Sequence Alignments

Definitions• Similarity: The extent to which nucleotide

or protein sequences are related. It is based upon identity plus conservation.

• Identity: The extent to which two sequences are invariant.

• Conservation: Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + + GTW++ MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEI V LHRWEN 81

Page 14: Pairwise Sequence Alignments

Definitions

• Identical - When a corresponding character is shared between two species or populations, that character is said to be identical.

• Similar - The degree to which two species or populations share identities.

• Homologous - When characters are similar due to common ancestry, they are homologous.

Page 15: Pairwise Sequence Alignments

Evolution and Alignment• Homology - two (or more) sequences

have a common ancestor. • This is a statement about evolutionary

history. • Similarity - two sequences are similar,

by some criterion. • It does not refer to any historical process,

just to a comparison of the sequences by some method.

• It is a logically weaker statement.

Page 16: Pairwise Sequence Alignments

Caution

• In bioinformatics these two terms are often confused and used interchangeably.

• The reason is probably that significant similarity is such a strong argument for homology.

Page 17: Pairwise Sequence Alignments

Similarity ≠ Homology

1) 25% similarity ≥ 100 AAs is strong evidence for homology

2) Since homology is an evolutionary statement, there should be additional evidence which indicates “descent from a common ancestor”

– common 3D structure– usually common function

3) Homology is all or nothing. You cannot say "50% homologous"

Page 18: Pairwise Sequence Alignments

retinol-binding protein(NP_006735)

b-lactoglobulin(P02754)

Page 42

Page 19: Pairwise Sequence Alignments

Definitions, Con’t.• Analogous - Characters are similar due to

convergent evolution. • Orthologous - Homologous sequences (or

characters) between different species that descended from a common ancestral gene during speciation; They may or may not be responsible for a similar function.

• Paralogous - Homologous sequences within a single species that arose by gene duplication.

• Homology is therefore NOT synonymous with similarity.

• Homology is a judgment, similarity is a measurement.

Page 20: Pairwise Sequence Alignments
Page 21: Pairwise Sequence Alignments
Page 22: Pairwise Sequence Alignments

Proteins or Genes Related by Evolution Share a Common

Ancestor• Random mutations in the sequences

accumulate over time, so that proteins or genes that have a common ancestor far back in time are not as similar as proteins or genes that diverged from each other more recently.

• Analysis of evolutionary relationships between protein or gene sequences depends critically on sequence alignments.

Page 23: Pairwise Sequence Alignments

Function is Conserved

• Alignments can reveal which parts of the sequences are likely to be important for the function, if the proteins are involved in similar processes.

• In parts of the sequence of a protein which are not very critical for its function, random mutations can easily accumulate.

• In parts of the sequence that are critical for the function of the protein, hardly any mutations will be accepted; nearly all changes in such regions will destroy the function.

Page 24: Pairwise Sequence Alignments

Sequence Alignments

• Comparing sequences provides information as to which genes have the same function

• Sequences are compared by aligning them – sliding them along each other to find the most matches with a few gaps

• An alignment can be scored – count matches, and can penalize mismatches and gaps

• It is much easier to align proteins. Why?

Page 25: Pairwise Sequence Alignments

Why Search with Protein, not DNA Sequences?

1) 4 DNA bases vs. 20 amino acids - less chance similarity

2) can have varying degrees of similarity between different AAs- # of mutations, chemical similarity, PAM

matrix

3) protein databanks are much smaller than DNA databanks

Page 26: Pairwise Sequence Alignments
Page 27: Pairwise Sequence Alignments

Similarity is Based on Dot Plots

1) two sequences on vertical and horizontal axes of graph

2) put dots wherever there is a match

3) diagonal line is region of identity (local alignment)

4) apply a window filter - look at a group of bases, must meet % identity to get a dot

Page 28: Pairwise Sequence Alignments

Definition

• Pairwise alignment:• The process of lining up two or more

sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.

Page 29: Pairwise Sequence Alignments

Dot Plots

A Simple Way to Measure Similarity

Page 30: Pairwise Sequence Alignments

Simple Dot Plot

  G G C T T G A C C G G

G A A       A       A A

G A A       A       A A

A             A        

T       A A            

T       A A            

G A A       A       A A

A             A        

C     A         A A    

C     A         A A    

C     A         A A    

G A A       A       A A

Page 31: Pairwise Sequence Alignments

Dot plot filtered with 4 base window and 75%

identityG A T C A A C T G A C G T A

G T T C A G C T G C G T A C

Page 32: Pairwise Sequence Alignments

Dot matrix provides visual picture of alignment

• It is used to easily spot segments of good sequence similarity.

• The two sequences are placed on each side of 2-dimensional matrix, and each cell in the matrix is then filled with a value for how well a short window of the sequences match at that point.

Page 33: Pairwise Sequence Alignments

Simple Dot Plot

  G G C T T G A C C G G

G A A       A       A A

G A A       A       A A

A             A        

T       A A            

T       A A            

G A A       A       A A

A             A        

C     A         A A    

C     A         A A    

C     A         A A    

G A A       A       A A

Page 34: Pairwise Sequence Alignments

A Limitation to Dot Matrix Comparison

• Where part of one sequence shares a long stretch of similarity with the other sequence, a diagonal of dots will be evident in the matrix.

• However, when single bases are compared at each position, most of the dots in the matrix will be due to background similarity.

• That is, for any two nucleotides compared between the two sequences, there is a 1 in 4 chance of a match, assuming equal frequencies of A,G,C and T.

Page 35: Pairwise Sequence Alignments

A Solution

• This background noise can be filtered out by comparing groups of l nucleotides, rather than single nucleotides, at each position.

• For example, if we compare dinucleotides (l = 2), the probability of two dinucleotides chosen at random from each sequence matching is 1/16, rather than 1/4.

• Therefore, the number of background matches will be lower:

Page 36: Pairwise Sequence Alignments

A Filtered Dot Plot

  G G C T T G A C C G G

G A                 A  

G           A          

A                      

T       A              

T         A            

G           A          

A             A        

C               A      

C               A      

C                 A    

G                      

Page 37: Pairwise Sequence Alignments

The Dot Matrix Algorithm

• The dot-matrix algorithm can be generalized for sequences s and t of sizes m and n, respectively, and window size l.

• For each position in sequence s, compare a window of l nucleotides centered at that position with each window of l nucleotides in sequence t.

• Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.

Page 38: Pairwise Sequence Alignments

• The dot-matrix algorithm can be generalized for sequences s and t of sizes m and n, respectively, and window size l.

• For each position in sequence s, compare a window of l nucleotides centered at that position with each window of l nucleotides in sequence t.

• Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.

Page 39: Pairwise Sequence Alignments

Dot Matrix Sequence Comparison Examples

Page 40: Pairwise Sequence Alignments

Examples

• These examples comes from the webpage www.bioinformaticsonline.org

• This has a nice discussion of results in another package, DNA Strider.

• I used COMPARE program in SeqWeb.• I used the BLOSUM 62 scoring matrix.

Page 41: Pairwise Sequence Alignments

Comparing a Protein with Itself

• Proteins can be compared with themselves to show internal duplications or repeating sequences.

• A self-matrix produces a central diagonal line through the origin, indicating an exact match between the x and y axes.

• The parallel diagonals that appear off the central line are indicative of repeated sequence elements in different locations of the same protein.

Page 42: Pairwise Sequence Alignments

Haptoglobin

• Haptoglobin is a protein that is secreted into the blood by the liver. This protein binds free hemoglobin.

• The concentration of "free" hemoglobin (that is, outside red blood cells) in plasma (the fluid portion of blood) is ordinarily very low.

• However, free hemoglobin is released when red blood cells hemolyze for any reason.

• After haptoglobin binds hemoglobin, it is taken up by the liver.

• The liver recycles the iron, heme, and amino acids contained in the hemoglobin protein.

Page 43: Pairwise Sequence Alignments

Our Comparison

• Files used– 1006264A Haptoglobin H2

• DNA sequencing shows that the intragenic duplication within the human haptoglobin Hp2 allele was formed by a non-homologous, probably random, crossing-over within different introns of two Hp1 genes.

• A repeated sequence (starting with ADDGCP...) is observed beginning at positions 30-90 and 90-150 - probably due to a duplication event in one of these locations.

Page 44: Pairwise Sequence Alignments

Window: 30 Stringency: 3Blosum 62 matrix

Page 45: Pairwise Sequence Alignments

• One of the strengths of dot-matrix searches is that they make repeats easy to detect by comparing a sequence against itself.

• In self comparisons, direct repeats appear as diagonals parallel to the main line of identity.

Page 46: Pairwise Sequence Alignments

Comparison of Two Similar Sequences

Page 47: Pairwise Sequence Alignments

Our Comparison• Files Used:

– P03035 • Repressor protein from E. coli Phage p22

– RPBPL • Repressor protein from E. coli phage Lambda

• Lambda phages infect E. coli. They can be lytic and destroys the host cell, making hundreds of progeny.

• They can also be lysogenic, and live quietly within the DNA of the bacteria.

• A gene makes the repressor protein that prevents the phage from going destructively lytic.

• Phage p22 is a related phage that also makes a repressor.

• Both proteins form a dimer and bind DNA to prevent lysis.

Page 48: Pairwise Sequence Alignments
Page 49: Pairwise Sequence Alignments

Dot Matrix Sequence Comparison

• A row of dots represents a region of sequence similarity.

• Background matching also appears as scattered dots.

• There is a decrease in background noise as window and stringency parameters increase.

Page 50: Pairwise Sequence Alignments

Window: 10 Stringency: 1Blosum 62 matrix

Page 51: Pairwise Sequence Alignments

Window: 10 Stringency: 3Blosum 62 matrix

Page 52: Pairwise Sequence Alignments

Window: 30 Stringency: 1Blosum 62 matrix

Page 53: Pairwise Sequence Alignments

Window: 30 Stringency: 3Blosum 62 matrix

Page 54: Pairwise Sequence Alignments

BLAST Sequence Alignment

• Perform a search of all sequences in a database for a match to a query sequence - BLAST search.– BLAST is an acronym for Basic Local

Alignment Search Tool.

• Search for patterns or domains in a sequence.

Page 55: Pairwise Sequence Alignments

Disadvantages to Dot Plots

• While dot-matrix searches provide a great deal of information in a visual fashion, they can only be considered semi-quantitative, and therefore do not lend themselves to statistical analysis.

• Also, dot-matrix searches do not provide a precise alignment between two sequences.

Page 56: Pairwise Sequence Alignments

Some Definitions for Sequence Alignments

Page 57: Pairwise Sequence Alignments

Gaps and Insertions

• In an alignment, much better correspondence can be obtained between two sequences if a gap can be introduced in one sequence.

• Alternatively, an insertion could be allowed in the other sequence.

• Biologically, this corresponds to a mutation event that eliminates a part of a gene, or introduces new DNA into a gene.

Page 58: Pairwise Sequence Alignments

Gaps

• Positions at which a letter is paired with a null are called gaps.

• Gap scores are typically negative. • Since a single mutational event may

cause the insertion or deletion of more than one residue, the presence of a gap is considered more significant than the length of the gap.

Page 59: Pairwise Sequence Alignments

Optimal Alignment

• The alignment that is the best, given a defined set of rules and parameter values for comparing different alignments.

• There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on.

• For example, what penalty should gaps carry? • All sequence alignment procedures make

some such assumptions.

Page 60: Pairwise Sequence Alignments

Global Alignment

• An alignment that assumes that the two strings are basically similar over the entire length of one another.

• The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing.

• A tiny example: LGPSTKDFGKISESREFDN | |||| | LNQLERSFGKINMRLEDA

Page 61: Pairwise Sequence Alignments

Local Alignments

• An alignment that searches for segments of the two sequences that match well.

• There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get:

----------FGKI---------- |||| ----------FGKI----------

Page 62: Pairwise Sequence Alignments

Local Alignments

• It may seem that one should always use local alignments.

• However, it may be difficult to spot an overall similarity, as opposed to just a domain-to-domain similarity, if one uses only local alignment.

• So global alignment is useful in some cases. • The popular programs BLAST and FASTA for

searching sequence databases produce local alignments.

Page 63: Pairwise Sequence Alignments

Are there other sequences like this one?

1) Huge public databases - GenBank, Swissprot, etc.

2) Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes

3) Similarity searching is based on alignment4) BLAST and FASTA provide rapid similarity

searchinga. rapid = approximate (heuristic)b. false + and - scores

Page 64: Pairwise Sequence Alignments

Global vs. Local similarity1) Global similarity uses complete aligned

sequences - total % matches – GCG GAP program, Needleman & Wunsch

algorithm

2) Local similarity looks for best internal matching region between 2 sequences – GCG BESTFIT program, – Smith-Waterman algorithm, – BLAST and FASTA

3) dynamic programming – optimal computer solution, not approximate

Page 65: Pairwise Sequence Alignments

What Program to use When?

1) BLAST is fastest and easily accessed on the Web– limited sets of databases– nice translation tools (BLASTX, TBLASTN)

2) FASTA works best in GCG– integrated with GCG– precise choice of databases– more sensitive for DNA-DNA comparisons– FASTX and TFASTX can find similarities in sequences with

frameshifts

3) Smith-Waterman is slower, but more sensitive – known as a “rigorous” or “exhaustive” search– SSEARCH in GCG and standalone FASTA

Page 66: Pairwise Sequence Alignments

Sequence Alignments

• Sometimes only parts of sequences match e.g. domain (longer) or motif (shorter) of a protein or a regulatory pattern in DNA

• Poor alignments can be misleading – you have to learn to recognize and test the significance of an alignment

Page 67: Pairwise Sequence Alignments

Comparing the protein kinase KRAF_HUMAN and the uncharacterized O22558 from

Arabidopsis using BLAST

546 AA Score = 185 bits (464), Expect = 1e-45 Identities = 107/283 (37%), Positives = 172/283 (59%), Gaps = 15/283 (5%)

Query: 337 DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA 395 D + WEI+ +++ + ++ SGS+G +++G + +VA+K LK E + F EV Sbjct: 274 DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF 333

Query: 396 VLRKTRHVNILLFMGYMTKD-NLAIVTQWCEGSSLYKHLHVQETKFQMFQLIDIARQTAQ 454 ++RK RH N++ F+G T+ L IVT++ S+Y LH Q+ F++ L+ +A A+ Sbjct: 334 IMRKVRHKNVVQFLGACTRSPTLCIVTEFMARGSIYDFLHKQKCAFKLQTLLKVALDVAK 393

Query: 455 GMDYLHAKNIIHRDMKSNNIFLHEGLTVKIGDFGLATVKSRWSGSQQVEQPTGSVLWMAP 514 GM YLH NIIHRD+K+ N+ + E VK+ DFG+A V+ SG E TG+ WMAP Sbjct: 394 GMSYLHQNNIIHRDLKTANLLMDEHGLVKVADFGVARVQIE-SGVMTAE--TGTYRWMAP 450

Query: 515 EVIRMQDNNPFSFQSDVYSYGIVLYELMTGELPYSHINNRDQIIFMVGRGYASPDLSKLY 574 EVI ++ P++ ++DV+SY IVL+EL+TG++PY+ + + +V +G P + K Sbjct: 451 EVI---EHKPYNHKADVFSYAIVLWELLTGDIPYAFLTPLQAAVGVVQKG-LRPKIPK-- 504

Query: 575 KNCPKAMKRLVADCVKKVKEERPLFPQILSSIELLQHSLPKIN 617 K PK +K L+ C + E+RPLF +I IE+LQ + ++N Sbjct: 505 KTHPK-VKGLLERCWHQDPEQRPLFEEI---IEMLQQIMKEVN 543