Transcript
Page 1: Pairwise sequence alignment

Pairwise sequence alignmentBased on presentation by Irit Gat-Viks,

which is based on presentation by Amir Mitchel,Introduction to bioinformatics course,

Bioinformatics unit, Tel Aviv University.and of Benny shomer, Bar-Ilan university

Page 2: Pairwise sequence alignment

Where we are in the course?

• Ways to interrogate biobanks:– By identifier-based search (GenBank etc.)– By genome location (genome browsers)– By mining annotation files with scrips

• Now: searching by sequence similarity

Page 3: Pairwise sequence alignment

What is it good for?1. Function inference – if we know something about A and A is

similar to B, we can say something about B – guilt by association

2. Conservation arguments – if we know that A and B do something similar, by looking at the conserved segments we can infer which parts of A and B are important for their function

3. Looking for repeats etc.

4. Identifying the position of an mRNA/any transcript in the genome

5. Resequencing

6. Etc.

Page 4: Pairwise sequence alignment

Issues with sequence similarity

• Things we’re after– A score: how well do two sequences fit?– Statistics: is this score significant or expected

at random?– Regions: which parts of the query and the

target sequence are actually similar/different?

• Next time– How to efficiently search a large sequence

database

Page 5: Pairwise sequence alignment

Start from simple: Dot plots

• The most intuitive method to compare two sequences.• Each dot represents a identity of two characters.• No real score/significance, but very easy to assess

visually

Page 6: Pairwise sequence alignment

To Reduce Random Noise in Dot Matrix

• Specify a window size, w

• Take w residues from each of the two sequences

• Among the w pairs of residues, count how many pairs are matches

• Specify a stringency

Page 7: Pairwise sequence alignment

Simple Dot Matrix, Window Size 1

  P V I L E P M M K V T I E M P

P 1         1                 1

V   1               1          

I     1                 1      

L       1                      

E         1               1    

P 1         1                 1

I     1                 1      

M             1 1           1  

R                              

V   1               1          

E         1               1    

V   1               1          

T                     1        

T                     1        

P 1         1                 1

Page 8: Pairwise sequence alignment

Window Size is 3  P V I L E P M M K V T I E M P

P 3         1     1 1         1

V   3               1 1        

I     3               1 1   1 1

L       3               1 1   1

E 1       2         1     1 1  

P 1 1     1 2         1 1     1

I     1     1 1         1 1    

M             1 2           1  

R 1   1           1   1     1 1

V   1   1       1   1   1     1

E 1       1       2       1    

V   1             1 2          

T       1           1 1   1    

T     1   1           2     2 1

P 1   1 1   1         1 1   1 3

Page 9: Pairwise sequence alignment

Window Size is 3; Stringency is 2

  P V I L E P M M K V T I E M P

P 3                            

V   3                          

I     3                        

L       3                      

E         2                    

P           2                  

I                              

M               2              

R                              

V                              

E                 2            

V                   2          

T                              

T                     2     2  

P                             3

Page 10: Pairwise sequence alignment

Protein Sequencessingle residue identity 6 out of 23 identical

Page 11: Pairwise sequence alignment

Insertion/Deletion, Inversion

Page 12: Pairwise sequence alignment

ABCDEFGEFGHIJKLMNO

tandem duplication

compared to no duplication

tandem duplication

compared to self

Page 13: Pairwise sequence alignment

What Is This?

5’ GGCGG 3’

Palindrome

(Intrastrand)

Page 14: Pairwise sequence alignment

Compare a sequence with itself…

• Identifies low complexity/repeat regions

Page 15: Pairwise sequence alignment

Dotlet example

• http://myhits.isb-sib.ch/cgi-bin/dotlet

• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=672

• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=353120

Page 16: Pairwise sequence alignment

DefinitionAlignment: A matching of two sequences. A good alignment will match many identical (similar) characters in the two sequences

VLSPAD-TNVK-AWAKVGAHAAGHG

||| | | |||| | ||||

VLSEAEWQ-VLHVWAKVEA--AGHG

Page 17: Pairwise sequence alignment

How similar are two sequences?

• The common measure of sequence similarity is their alignment score

• Simpler measures, e.g., % identity are also common

• These require algorithm that compute the optimal alignment between sequences

Page 18: Pairwise sequence alignment

How to present the alignment?

• | - character-wise identity

• : - very similar amino acids

• . – less similar amino acids

• - gap in out of the sequences

Page 19: Pairwise sequence alignment

Pairwise Alignment - Scoring

• The final score of the alignment is the sum of the positive scores and penalty scores:

+ Number of Identities

+ Number if Similarities

- Number of Dissimilarities

- Number of Gap openings

- Number of Gap extensions

Alignment score

Page 20: Pairwise sequence alignment

Comparison methods

• Global alignment – Finds the best alignment across the whole two sequences.

• Local alignment – Finds regions of similarity in parts of the sequences.

Global Local

_____ _______ __ ____

__ ____ ____ __ ____

Page 21: Pairwise sequence alignment

Global Alignment

• Algorithm of Needleman and Wunsch (1970) • Finds the alignment of two complete sequences:

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

• Semi-global alignment allows “free ends”GFHKKKADLGAVFALCDRYFQ

|||| |||| |ADLGRTQN-CDRYYQJKLLKJ

Page 22: Pairwise sequence alignment

Local Alignment

• Algorithm of Smith and Waterman (1981)

• Makes an optimal alignment of the best segment of

similarity between two sequences.

ADLG CDRYFQ

|||| |||| |

ADLG CDRYYQ

• Can return a number of well aligned segments.

Page 23: Pairwise sequence alignment

Finding an optimal alignment

• Pairwise alignment algorithms identify the highest scoring alignment from all possible alignments.

• Different scoring systems can produce (very) different best alignments!!!

• Unfortunately the number of possible alignments if pretty huge

• Dynamic programming to the rescue

Page 24: Pairwise sequence alignment

Intuition of Dynamic Programming

• Lets say we want to align XYZ and ABC• If we already computed the optimal way to:

• Align XY and AB – Opt1

• Align XY and ABC – Opt2

• Align XYZ and AB – Opt3

• We now need to test three possible alignments Opt1Z or Opt2Z or Opt3-Opt1C Opt2- Opt3C

(where “-” indicates a gap).

Thus, if we construct small alignments first, we are able to extend then by testing only 3 scenarios.

Page 25: Pairwise sequence alignment

Formally: solving global alignment

Global Alignment Problem:

Input: Two sequences S=s1…sn, T=t1….tm (n~m)Goal: Find an optimal alignment according to the

alignment quality (or scoring).

Notation: • Let (a,b) be the score (weight) of the alignment of

character a with character b.• Let V(i,j) be the optimal score of the alignment of

S’=s1…si and T’=t1…tj (0 i n, 0 j m)

Page 26: Pairwise sequence alignment

V(k,l) is computed as follows:• Base conditions:

– V(i,0) = k=0..i(sk,-)

– V(0,j) = k=0..j(-,tk)

• Recurrence relation:V(i-1,j-1) + (si,tj)

1in, 1jm: V(i,j) = max V(i-1,j) + (si,-)

V(i,j-1) + (-,tj)

Alignment with 0 elements spacing

S’=s1...si-1 with T’=t1...tj-1

si with tj.

S’=s1...si with T’=t1...tj-1and ‘-’ with tj.

V(i,j) := optimal score of the alignment

of S’=s1…si and T’=t1…tj (0 i n, 0 j m)

Page 27: Pairwise sequence alignment

Optimal Alignment - Tabular Computation

• Use dynamic programming to compute V(i,j) for all possible i,j values:

Snapshot of computing the table

Costs: match 2, mismatch/indel -1

for i=1 to n do

begin

For j=1 to m do

begin

Calculate V(i,j) using V(i-1,j-1),

V(i,j-1), V(i-1,j)

end

end

Page 28: Pairwise sequence alignment

Optimal Alignment - Tabular Computation

• Add back pointer(s) from cell (i,j) to father cell(s) realizing V(i,j).

• Trace back the pointers from (m,n) to (0,0)

• Needleman-Wunsch, ‘70

Backtracking the alignment

Page 29: Pairwise sequence alignment

Solving Local Alignment

• Algorithm of Smith and Waterman (1981).• V(i,j) : the value of optimal local alignment between

S[1..i] and T[1..j]• Assume the weights fulfill the following condition:

(x,y) = 0 if x,y match

0 o/w (mismatch or indel)

Page 30: Pairwise sequence alignment

Computing Local Alignment (2)

A scheme of the algorithm:• Find maximum similarity

between suffixes of S’=s1...si and T’=t1...tj

• Discard the prefixes S’=s1...si, and T’=t1...tj whose similarity is 0 (and therefore decrease the overall similarity)

• Find the indices i*, j* of S and T respectively after which the similarity only decreases.

Algorithm - Recursive Definition

Base Condition:

i,j V(i,0) = 0, V(0,j) = 0

Recursion Step: i>0, j>0

0,

V(i,j) = max V(i-1, j-1) + (si, tj),

V(i, j-1) + (-, tj),

V(i-1, j) + (si, -)

Compute i*, j*

s.t. V(i*, j*) = max1i n, 1 j mV(i,j)

Page 31: Pairwise sequence alignment

• As usual the pointers are created while filling the values in the table,• The alignments are found by tracking the pointers from cell (i*, j*) until reaching an entry (i’, j’) that has value 0.

Page 32: Pairwise sequence alignment

Computational complexity

• Computing the table requires O(n2) operations for both global and local alignment

• Saving the pointers for traceback - O(n2)• But – what if we are only interested in the

optimal alignment score?• Only need to remember the last row – O(n)

space

Page 33: Pairwise sequence alignment

Outline

• We now figured out• What an alignment is• What alignment score consists of• How to ± efficiently compute an optimal

alignment• Still left to figure out

• Where do we obtain good σ(i,j) values• When do we use global/local alignment• How to use alignment to search large

databases

Page 34: Pairwise sequence alignment

Scoring amino acid similarity• Identity: Count the number of identical matches, divide by length of

aligned region. The homology rule: above 25% for amino acids, above 75% for nucleotides.

• Similarity: A less well defined measure

Category Amino Acid

Acids and Amides

Asp (D) Glu(E) Asn (N) Gln (Q)

Basic His (H) Lys (K) Arg (R)

Aromatic Phe (F) Tyr (Y) Trp (W)

Hydrophilic Ala (A) Cys (C) Gly (G) Pro (P) Ser (S) Thr (T)

Hydrophobic Ile (I) Leu (L) Met (M) Val (V)

• A problematic idea: Give positive score for aligning amino acids from the same group

Can we find a better definition for similarity?

Page 35: Pairwise sequence alignment

Scoring System based on evolution

• Some substitutions are more frequent than other substitutions

• Chemically similar amino acids can be replaced without severely effecting the protein’s function and structure

• Orthologous proteins: proteins derived from the same common ancestor

• By comparing reasonably close orthologous proteins we can compute the relative frequencies of different amino acid changes

• Amino acid substitution matrices: Families of matrices that list the probability of change from one amino acid to another during evolution (i.e., defining identity and similarity relationships between amino acids).

• The two most popular matrices are the PAM and the BLOSUM matrix

Page 36: Pairwise sequence alignment

PAM matrix

• PAM units measure evolutionary distance.

• 1 PAM unit indicates the probability of 1 point mutation per 100 residues.

• Multiplying PAM1 by itself gives higher PAMs matrices that are suitable for larger evolutionary distance.

• JTT matrices are a newer generation of PAMs

Page 37: Pairwise sequence alignment

PAM 1

Page 38: Pairwise sequence alignment

PAM 250

Page 39: Pairwise sequence alignment

Log Odds matrices

• The score might arise from bias in amino acid frequency -> We use the log odds of the PAM matrix.

(120 PAM)

Page 40: Pairwise sequence alignment

Rules of thumb

• The most widely used PAM250 is good for about 20% identity between the proteins

• 40% --> PAM120• 50% --> PAM80• 60% --> PAM60

Page 41: Pairwise sequence alignment

PAM vs. BLUSOM• Choosing n

– Different BLOSUM matrices are derived from blocks with different identity percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n smaller evolutionary distance.

– Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n larger evolutionary distance

• Blosum matrices are newer (based on more sequences)

Observed % Difference

Evolutionary distance (PAM)

BLOSUM

1 1 9910 11 9020 23 8030 38 7040 56 6050 80 5060 120 4070 159 3080 250 20

62

120

250

Page 42: Pairwise sequence alignment

Gap parameters

• Observation: Alignments can differ significantly when using different gap parameters.

• Assumption: For each matrix there are constant default parameters that produce optimum alignments.

• Each matrix was checked with different parameters until a “true” alignment was reached.– Where can we obtain “true” alignments?– We can use sequence alignments based on structural

alignments. – The structural alignments are “true” for our purpose.

Page 43: Pairwise sequence alignment

DNA scoring matrices

• Uniform substitutions in all nucleotides:

From

To

A G C T

A 2

G -6 2

C -6 -6 2

T -6 -6 -6 2

MatchMismatch

Page 44: Pairwise sequence alignment

DNA scoring matrices• The bases are divided to two groups, purines (A,G) and

pyrmidines (C,T) • Mutations are divided into transitions and transversions. • Transitions – purine to purine or pyrmidine to pyrmidine (4

possibilities) .• Transversions – purine to pyrmidine or pyrmidine to

purine. (8 possibilities).• By chance alone transversions should occur twice than

transition.• De-facto transitions are more frequent than transversions• Bottom line:

Meaningful DNA substitution matrices can be defined

Page 45: Pairwise sequence alignment

DNA scoring matrices

• Non-uniform substitutions in all nucleotides:

From

To

A G C T

A 2

G -4 2

C -6 -6 2

T -6 -6 -4 2

MatchMismatchtransition

Mismatchtransversion

Page 46: Pairwise sequence alignment

Evaluation• Remember that one of our goals was to estimate

significance of the scores

• How can we estimate the significance of the alignment?

• Which alignment is better ?

A T C G C

A T - G C

A A C A A

A A - A A?

Does the score arise from order or from composition?

Page 47: Pairwise sequence alignment

Evaluation - bootstrap approach

• Data with same composition but different order:

1. Shuffle one of the sequences.

2. Re-align and score.

3. Repeat numerous times.

4. Calculate the mean and standard deviation

of shuffled alignments scores.

Page 48: Pairwise sequence alignment

Evaluation - bootstrap approach

• Data with the same composition but with a different order:

Shuffle one of the sequences

Align with thesecond sequence

Calculate mean and standard deviation of shuffled alignments

Compare alignment score with mean of shuffled alignments

Page 49: Pairwise sequence alignment
Page 50: Pairwise sequence alignment

Evaluation• We can compare the score of the original

alignment with the average score of the shuffled alignments.

• Thumb rule:If:original alignment >>average score + 6*SDThen:the alignment is statistically significant.

Page 51: Pairwise sequence alignment

Global or local?

• Two human transcription factors:

1. SP1 factor, binds to GC rich areas.

2. EGR-1 factor, active at differentiation stage

(Fasta fromats from http://us.expasy.org/sprot/)

Page 52: Pairwise sequence alignment

>sp|P08047|SP1_HUMAN Transcription factor Sp1 - Homo sapiens (Human). MSDQDHSMDEMTAVVKIEKGVGGNNGGNGNGGGAFSQARSSSTGSSSSTGGGGQESQPSP

LALLAATCSRIESPNENSNNSQGPSQSGGTGELDLTATQLSQGANGWQIISSSSGATPTS KEQSGSSTNGSNGSESSKNRTVSGGQYVVAAAPNLQNQQVLTGLPGVMPNIQYQVIPQFQ TVDGQQLQFAATGAQVQQDGSGQIQIIPGANQQIITNRGSGGNIIAAMPNLLQQAVPLQG LANNVLSGQTQYVTNVPVALNGNITLLPVNSVSAATLTPSSQAVTISSSGSQESGSQPVT SGTTISSASLVSSQASSSSFFTNANSYSTTTTTSNMGIMNFTTSGSSGTNSQGQTPQRVS GLQGSDALNIQQNQTSGGSLQAGQQKEGEQNQQTQQQQILIQPQLVQGGQALQALQAAPL SGQTFTTQAISQETLQNLQLQAVPNSGPIIIRTPTVGPNGQVSWQTLQLQNLQVQNPQAQ TITLAPMQGVSLGQTSSSNTTLTPIASAASIPAGTVTVNAAQLSSMPGLQTINLSALGTS GIQVHPIQGLPLAIANAPGDHGAQLGLHGAGGDGIHDDTAGGEEGENSPDAQPQAGRRTR REACTCPYCKDSEGRGSGDPGKKKQHICHIQGCGKVYGKTSHLRAHLRWHTGERPFMCTW SYCGKRFTRSDELQRHKRTHTGEKKFACPECPKRFMRSDHLSKHIKTHQNKKGGPGVALS VGTLPLDSGAGSEGSGTATPSALITTNMVAMEAICPEGIARLANSGINVMQVADLQSINI SGNGF

>sp|P18146|EGR1_HUMAN Early growth response protein 1 (EGR-1) (Krox-24 protein) (ZIF268) (Nerve growth factor-induced protein A) (NGFI-A) (Transcription factor ETR103) (Zinc finger protein 225) (AT225) - Homo sapiens (Human).

MAAAKAEMQLMSPLQISDPFGSFPHSPTMDNYPKLEEMMLLSNGAPQFLGAAGAPEGSGS NSSSSSSGGGGGGGGGSNSSSSSSTFNPQADTGEQPYEHLTAESFPDISLNNEKVLVETS YPSQTTRLPPITYTGRFSLEPAPNSGNTLWPEPLFSLVSGLVSMTNPPASSSSAPSPAAS SASASQSPPLSCAVPSNDSSPIYSAAPTFPTPNTDIFPEPQSQAFPGSAGTALQYPPPAY PAAKGGFQVPMIPDYLFPQQQGDLGLGTPDQKPFQGLESRTQQPSLTPLSTIKAFATQSG SQDLKALNTSYQSQLIKPSRMRKYPNRPSKTPPHERPYACPVESCDRRFSRSDELTRHIR IHTGQKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDICGRKFARSDERKRHTKIHLR QKDKKADKSVVASSATSSLSSYPSPVATSYPSPVTTSYPSPATTSYPSPVPTSFSSPGSS TYPSPVHSGFPSPSVATTYSSVPPAFPAQVSSFPSSAVTNSFSASTGLSDMTATFSPRTI EIC

Page 53: Pairwise sequence alignment

SP1 at swissprot

Page 54: Pairwise sequence alignment

EGR1 at swissprot

Page 55: Pairwise sequence alignment

Available softwares…

• http://en.wikipedia.org/wiki/Sequence_alignment_software

• http://fasta.bioch.virginia.edu/fasta_www/home.html– LAlign (local alignment), PLalign(dot plot)– PRSS/ PRFX (significance by Monte Carlo)

• http://bioportal.weizmann.ac.il/toolbox/overview.html (Many useful software), Needle, Water.

• Bl2seq (NCBI)

Page 56: Pairwise sequence alignment

Using LAlign

• http://www.ch.embnet.org/software/LALIGN_form.html

• http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_006758.2

• http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_066300.1

Page 57: Pairwise sequence alignment
Page 58: Pairwise sequence alignment

Bl2Seq at NCBIhttp://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

Page 59: Pairwise sequence alignment

Bl2seq results

Page 60: Pairwise sequence alignment

Conclusions• The proteins share only a limited area of sequence

similarity. Therefore, the use of local alignment is recommended.

• We found a local alignment that pointed to a possible structural similarity, which points to a possible function similarity.

• Reasons to make Global alignment:• Checking minor differences between close homologous.• Analyzing polymorphism.• A good reason

Page 61: Pairwise sequence alignment

Sequence comparisons

Goal: similarity search on sequence database

Multiple pairwise comparisons

We wish to optimize for speed, not accuracy

BLAST, FASTA programs

Next goal: refine database search, are the reported

matches really interesting?

Goal: Comparing two specific sequences

Single pairwise comparisons

We wish to optimize for accuracy, not speed

Dynamic programming methods (Smith-Waterman,

Needleman-Wunsch)

Identify homologous, common domains, common active sites

etc.


Top Related