# pairwise sequence alignment

Post on 22-Jan-2016

32 views

Embed Size (px)

DESCRIPTION

Pairwise sequence alignment. Based on presentation by Irit Gat-Viks, which is based on presentation by Amir Mitchel, Introduction to bioinformatics course, Bioinformatics unit, Tel Aviv University. and of Benny shomer, Bar-Ilan university. Where we are in the course?. - PowerPoint PPT PresentationTRANSCRIPT

Pairwise sequence alignment

Based on presentation by Irit Gat-Viks,which is based on presentation by Amir Mitchel,Introduction to bioinformatics course,Bioinformatics unit, Tel Aviv University.and of Benny shomer, Bar-Ilan university

Where we are in the course?Ways to interrogate biobanks:By identifier-based search (GenBank etc.)By genome location (genome browsers)By mining annotation files with scripsNow: searching by sequence similarity

What is it good for?Function inference if we know something about A and A is similar to B, we can say something about B guilt by associationConservation arguments if we know that A and B do something similar, by looking at the conserved segments we can infer which parts of A and B are important for their functionLooking for repeats etc.Identifying the position of an mRNA/any transcript in the genomeResequencingEtc.

Issues with sequence similarityThings were afterA score: how well do two sequences fit?Statistics: is this score significant or expected at random?Regions: which parts of the query and the target sequence are actually similar/different?Next timeHow to efficiently search a large sequence database

Topics to be CoveredIntroductionComparison methods global/local alignmentAlignment parametersAlignment scoring matrices proteinsAlignment scoring matrices DNAEvaluationComparison programs

Start from simple: Dot plotsThe most intuitive method to compare two sequences.Each dot represents a identity of two characters.No real score/significance, but very easy to assess visually

To Reduce Random Noise in Dot MatrixSpecify a window size, wTake w residues from each of the two sequencesAmong the w pairs of residues, count how many pairs are matchesSpecify a stringency

Simple Dot Matrix, Window Size 1

PVILEPMMKVTIEMPP111V11I11L1E11P111I11M111RV11E11V11T1T1P111

Window Size is 3

PVILEPMMKVTIEMPP31111V311I31111L3111E12111P1112111I11111M121R111111V111111E1121V112T1111T11221P11111113

Window Size is 3; Stringency is 2

PVILEPMMKVTIEMPP3V3I3L3E2P2IM2RVE2V2TT22P3

Protein Sequencessingle residue identity6 out of 23 identical

Insertion/Deletion, Inversion

ABCDEFGEFGHIJKLMNOtandem duplicationcompared to no duplicationtandem duplicationcompared to self

What Is This?5 GGCGG 3 Palindrome (Intrastrand)

Compare a sequence with itselfIdentifies low complexity/repeat regions

Dotlet examplehttp://myhits.isb-sib.ch/cgi-bin/dotlethttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=672http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=353120

DefinitionAlignment: A matching of two sequences. A good alignment will match many identical (similar) characters in the two sequences VLSPAD-TNVK-AWAKVGAHAAGHG||| | | |||| | ||||VLSEAEWQ-VLHVWAKVEA--AGHG

How similar are two sequences?The common measure of sequence similarity is their alignment scoreSimpler measures, e.g., % identity are also commonThese require algorithm that compute the optimal alignment between sequences

How to present the alignment?| - character-wise identity: - very similar amino acids. less similar amino acids- gap in out of the sequences

Pairwise Alignment - ScoringThe final score of the alignment is the sum of the positive scores and penalty scores:

+ Number of Identities+ Number if Similarities- Number of Dissimilarities- Number of Gap openings- Number of Gap extensionsAlignment score

Comparison methodsGlobal alignment Finds the best alignment across the whole two sequences. Local alignment Finds regions of similarity in parts of the sequences. Global Local _____ _______ __ ____ __ ____ ____ __ ____

Global AlignmentAlgorithm of Needleman and Wunsch (1970) Finds the alignment of two complete sequences: ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

Semi-global alignment allows free endsGFHKKKADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQJKLLKJ

Local AlignmentAlgorithm of Smith and Waterman (1981)Makes an optimal alignment of the best segment of similarity between two sequences.ADLGCDRYFQ|||| |||| |ADLGCDRYYQ

Can return a number of well aligned segments.

Finding an optimal alignmentPairwise alignment algorithms identify the highest scoring alignment from all possible alignments.Different scoring systems can produce (very) different best alignments!!! Unfortunately the number of possible alignments if pretty hugeDynamic programming to the rescue

Intuition of Dynamic Programming Lets say we want to align XYZ and ABC If we already computed the optimal way to: Align XY and AB Opt1 Align XY and ABC Opt2 Align XYZ and AB Opt3 We now need to test three possible alignments Opt1Z or Opt2Z orOpt3-Opt1COpt2- Opt3C(where - indicates a gap).

Thus, if we construct small alignments first, we are able to extend then by testing only 3 scenarios.

Formally: solving global alignmentGlobal Alignment Problem:Input: Two sequences S=s1sn, T=t1.tm (n~m)Goal: Find an optimal alignment according to the alignment quality (or scoring).

Notation: Let (a,b) be the score (weight) of the alignment of character a with character b.Let V(i,j) be the optimal score of the alignment of S=s1si and T=t1tj (0 i n, 0 j m)

V(k,l) is computed as follows:Base conditions: V(i,0) = k=0..i(sk,-)V(0,j) = k=0..j(-,tk)Recurrence relation:V(i-1,j-1) + (si,tj)1in, 1jm: V(i,j) = max V(i-1,j) + (si,-)V(i,j-1) + (-,tj)Alignment with 0 elements spacingS=s1...si-1 with T=t1...tj-1 si with tj.S=s1...si with T=t1...tj-1and - with tj.V(i,j) := optimal score of the alignment of S=s1si and T=t1tj (0 i n, 0 j m)

Optimal Alignment - Tabular ComputationUse dynamic programming to compute V(i,j) for all possible i,j values:for i=1 to n dobegin For j=1 to m do begin Calculate V(i,j) using V(i-1,j-1), V(i,j-1), V(i-1,j) endend

Optimal Alignment - Tabular ComputationAdd back pointer(s) from cell (i,j) to father cell(s) realizing V(i,j).Trace back the pointers from (m,n) to (0,0) Needleman-Wunsch, 70Backtracking the alignment

Solving Local Alignment

Algorithm of Smith and Waterman (1981).V(i,j) : the value of optimal local alignment between S[1..i] and T[1..j]Assume the weights fulfill the following condition:(x,y) = 0if x,y match 0o/w (mismatch or indel)

Computing Local Alignment (2)A scheme of the algorithm:Find maximum similarity between suffixes of S=s1...si and T=t1...tjDiscard the prefixes S=s1...si, and T=t1...tj whose similarity is 0 (and therefore decrease the overall similarity)Find the indices i*, j* of S and T respectively after which the similarity only decreases.

As usual the pointers are created while filling the values in the table, The alignments are found by tracking the pointers from cell (i*, j*) until reaching an entry (i, j) that has value 0.

Computational complexityComputing the table requires O(n2) operations for both global and local alignmentSaving the pointers for traceback - O(n2)But what if we are only interested in the optimal alignment score?Only need to remember the last row O(n) space

OutlineWe now figured outWhat an alignment isWhat alignment score consists ofHow to efficiently compute an optimal alignmentStill left to figure outWhere do we obtain good (i,j) valuesWhen do we use global/local alignmentHow to use alignment to search large databases

Scoring amino acid similarityIdentity: Count the number of identical matches, divide by length of aligned region. The homology rule: above 25% for amino acids, above 75% for nucleotides.Similarity: A less well defined measure A problematic idea: Give positive score for aligning amino acids from the same groupCan we find a better definition for similarity?

Scoring System based on evolutionSome substitutions are more frequent than other substitutionsChemically similar amino acids can be replaced without severely effecting the proteins function and structureOrthologous proteins: proteins derived from the same common ancestorBy comparing reasonably close orthologous proteins we can compute the relative frequencies of different amino acid changesAmino acid substitution matrices: Families of matrices that list the probability of change from one amino acid to another during evolution (i.e., defining identity and similarity relationships between amino acids).The two most popular matrices are the PAM and the BLOSUM matrix

PAM matrixPAM units measure evolutionary distance.1 PAM unit indicates the probability of 1 point mutation per 100 residues.Multiplying PAM1 by itself gives higher PAMs matrices that are suitable for larger evolutionary distance.JTT matrices are a newer generation of PAMs

PAM 1

PAM 250

Log Odds matricesThe score might arise from bias in amino acid frequency -> We use the log odds of the PAM matrix.

(120 PAM)

Rules of thumbThe most widely used PAM250 is good for about 20% identity between the proteins40% --> PAM12050% --> PAM8060% --> PAM60

PAM vs. BLUSOMChoosing nDifferent BLOSUM matrices are derived from blocks with different identity percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n smaller evolutionary distance.Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n larger evolutionary distance

Recommended