pair-wise sequence alignment

61
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Upload: pippa

Post on 23-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Introduction to bioinformatics 2007 Lecture 5. Pair-wise Sequence Alignment. Bioinformatics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pair-wise Sequence Alignment

Pair-wise Sequence Alignment

Introduction to bioinformatics 2007

Lecture 5

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Page 2: Pair-wise Sequence Alignment

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

“Nothing in bioinformatics makes sense except in the light of Biology”

Bioinformatics

Page 3: Pair-wise Sequence Alignment

The Genetic Code

Page 4: Pair-wise Sequence Alignment

A protein sequence alignmentMSTGAVLIY--TSILIKECHAMPAGNE--------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** ***

A DNA sequence alignmentattcgttggcaaatcgcccctatccggccttaaatt---tggcggatcg-cctctacgggcc----*** **** **** ** ******

Page 5: Pair-wise Sequence Alignment

Searching for similarities

What is the function of the new gene?

The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques):

– Find a set of similar protein sequences to the unknown sequence

– Identify similarities and differences

– For long proteins: first identify domains

Page 6: Pair-wise Sequence Alignment

Is similarity really interesting?

Page 7: Pair-wise Sequence Alignment

Evolutionary and functional relationships

Reconstruct evolutionary relation:

• Based on sequence-Identity (simplest method)-Similarity• Homology (common ancestry: the ultimate goal)• Other (e.g., 3D structure)

Functional relation:Sequence Structure Function

Page 8: Pair-wise Sequence Alignment

Common ancestry is more interesting:Makes it more likely that genes sharethe same function

Homology: sharing a common ancestor– a binary property (yes/no)– it’s a nice tool:When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion.

Searching for similarities

Page 9: Pair-wise Sequence Alignment

How to go from DNA to protein sequence

A piece of double stranded DNA:

5’ attcgttggcaaatcgcccctatccggc 3’3’ taagcaaccgtttagcggggataggccg 5’

DNA direction is from 5’ to 3’

Page 10: Pair-wise Sequence Alignment

How to go from DNA to protein sequence

6-frame translation using the codon table (last lecture):

5’ attcgttggcaaatcgcccctatccggc 3’

3’ taagcaaccgtttagcggggataggccg 5’

Page 11: Pair-wise Sequence Alignment

Bioinformatics tool

Data

Algorithm

BiologicalInterpretation

(model)

tool

Page 12: Pair-wise Sequence Alignment

Example today: Pairwise sequence alignment needs sense of evolution

Global dynamic programmingMDAGSTVILCFVG

MDAASTILCGS Amino Acid

Exchange Matrix

Gap penalties (open,extension)

Search matrix

MDAGSTVILCFVG-MDAAST-ILC--GS

Evolution

Page 13: Pair-wise Sequence Alignment

How to determine similarity?• Frequent evolutionary events:

1. Substitution2. Insertion, deletion3. Duplication4. Inversion

• Evolution at work

We’ll only use these

Z

X Y

Common ancestor, usually extinct

available

Page 14: Pair-wise Sequence Alignment

Alignment - the problem

Operations: – substitution, – Insertion– deletion

GTACT--CGGA-TGAC

algorithmsforgenomes||||||||||algorithms4genomes

Algorithmsforgenomes ||||||| algorithms4genomes

algorithmsforgenomes||||||||||? |||||||algorithms4--genomes

Page 15: Pair-wise Sequence Alignment

A protein sequence alignmentMSTGAVLIY--TSILIKECHAMPAGNE--------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** ***

A DNA sequence alignmentattcgttggcaaatcgcccctatccggccttaaatt---tggcggatcg-cctctacgggcc----*** **** **** ** ******

Page 16: Pair-wise Sequence Alignment

– Substitution (or match/mismatch)

• DNA

• proteins

– Gap penalty

• Linear: gp(k)=ak

• Affine: gp(k)=b+ak

• Concave, e.g.: gp(k)=log(k)

The score for an alignment is the sum of the scores of all alignment columns

Dynamic programmingScoring alignments

Page 17: Pair-wise Sequence Alignment

– Substitution (or match/mismatch)

• DNA

• proteins

– Gap penalty

• Linear: gp(k)=ak

• Affine: gp(k)=b+ak

• Concave, e.g.: gp(k)=log(k)The score for an alignment is the sum of the scores over all alignment columns

Dynamic programmingScoring alignments

• /Gap length

pena

lty

affine

concave

linear

,

General alignment score:

Sa,b = )(kgpNk

k li jbas ),(

Page 18: Pair-wise Sequence Alignment

Substitution Matrices: DNA define a score for match/mismatch of lettersSimple:

Used in genome alignments:

A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1

A C G TA 91 -114 -31 -123C -114 100 -125 -31G -31 -125 100 -114

T -123 -31 -114 91

Page 19: Pair-wise Sequence Alignment

Substitution matrices for a.a.

Amino acids are not equal:1. Some are similar and easily

substituted:• biochemical properties• structure

2. Some mutations occur more often due to similar codons

The two above give us substitution matrices

orange: nonpolar and hydrophobic.green: polar and hydrophilic magenta box are acidiclight blue box are basic

http://www.cimr.cam.ac.uk/links/codon.htm

http

://w

ww

.peo

ple.

virg

inia

.edu

/~rjh

9u/a

min

acid

.htm

l

Page 20: Pair-wise Sequence Alignment

BLOSUM 62 substitution matrix

http://www.carverlab.org/testing/epp.html

Page 21: Pair-wise Sequence Alignment

Constant vs. affine gap scoring

Seq1 G T A - - G - T - ASeq2 - - A T G - A T G -

Const -2 –2 1 –2 –2 (SUM = -7) -2 –2 1 -2 –2 (SUM = -7)

Affine –4 1 -4 (SUM = -7) -3 –3 1 -3 –3 (SUM = -11)

-1-2

extensionintroGapScoring

-3Affine-2Constant

…and +1 for match

Page 22: Pair-wise Sequence Alignment

Dynamic programmingScoring alignments

10 1Amino Acid Exchange Matrix Affine gap

penalties (open, extension)

2020

Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px ++s(L,I)+s(K,K)

T D W V T A L KT D W L - - I K

Page 23: Pair-wise Sequence Alignment

Amino acid exchange matrices

How do we get one?

And how do we get associated gap penalties?

First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure

2020

Page 24: Pair-wise Sequence Alignment

A 2

R -2 6

N 0 0 2

D 0 -1 2 4

C -2 -4 -4 -5 12

Q 0 1 1 2 -5 4

E 0 -1 1 3 -5 2 4

G 1 -3 0 1 -3 -1 0 5

H -1 2 2 1 -3 3 1 -2 6

I -1 -2 -2 -2 -2 -2 -2 -3 -2 5

L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6

K -1 3 1 0 -5 1 0 -2 0 -2 -3 5

M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6

F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9

P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6

S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2

T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3

W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17

Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10

V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

A R N D C Q E G H I L K M F P S T W Y V

PAM250 matrix

amino acid exchange matrix (log odds)

Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation

Page 25: Pair-wise Sequence Alignment

Amino acids are not equal:

1. Some are easily substituted because they have similar:

• physico-chemical properties

• structure

2. Some mutations between amino acids occur more often due to similar codons

The two above observations give us ways to define substitution matrices

Amino acid exchange matrices

Page 26: Pair-wise Sequence Alignment

Pair-wise alignment

Combinatorial explosion- 1 gap in 1 sequence: n+1 possibilities- 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc.

2n (2n)! 22n

= ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

T D W V T A L KT D W L - - I K

Page 27: Pair-wise Sequence Alignment

Technique to overcome the combinatorial explosion:Dynamic Programming

• Alignment is simulated as Markov process, all sequence positions are seen as independent

• Chances of sequence events are independent– Therefore, probabilities per aligned position

need to be multiplied– Amino acid matrices contain so-called log-odds

values (log10 of the probabilities), so probabilities can be summed

Page 28: Pair-wise Sequence Alignment

• To perform statistical analyses on messages or sequences, we need a reference model.

• The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner.

• This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment.

• Given a probability distribution, Pi, for the letters in a i.i.d. message, the probability of seeing a particular sequence of letters i, j, k, ... n is simply Pi Pj Pk···Pn.

• As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log Pi + log Pj + log Pk+ ··· + log Pn.

• In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score.

To say the same more statistically…

Page 29: Pair-wise Sequence Alignment

Sequence alignmentHistory of Dynamic

Programming algorithm

1970 Needleman-Wunsch global pair-wise alignment

Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53.

1981 Smith-Waterman local pair-wise alignment

Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.

Page 30: Pair-wise Sequence Alignment

Pairwise sequence alignmentGlobal dynamic programming

MDAGSTVILCFVGMDAASTILCGS

Amino Acid Exchange

Matrix

Gap penalties (open,extension)

Search matrix

MDAGSTVILCFVG-MDAAST-ILC--GS

Evolution

Page 31: Pair-wise Sequence Alignment

Global dynamic programming

i-1i

j-1 j

H(i,j) = Max H(i-1,j-1) + S(i,j)H(i-1,j) - gH(i,j-1) - g

diagonalverticalhorizontal

Value from residue exchange matrix

This is a recursive formula

Page 32: Pair-wise Sequence Alignment

Global alignmentThe Needleman-Wunsch algorithm

• Goal: find the maximal scoring alignment• Scores: m match, -s mismatch, -g for insertion/deletion• Dynamic programming

– Solve smaller subproblem(s)– Iteratively extend the solution

• The best alignment for X[1…i] and Y[1…j] is called M[i, j]

X1 … Xi-1 - - Xi

Y1 … - Yj-1 Yj -

Page 33: Pair-wise Sequence Alignment

The algorithm

Y[1…j-1] Y[j]X[1…i-1] X[i]

Y[1…j-1]Y[j]

X[1…i] -Y[1…j] -

X[1…i-1] X[i]

M[i, j-1] - g

M[i-1, j] - g

M[i-1, j-1]

+ m- sM[i,j]

=

Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for

insertion/deletion The best alignment for X[1…i] and Y[1…j]

is called M[i, j] 3 ways to extend the alignment

Page 34: Pair-wise Sequence Alignment

The algorithm – final equation

M[i, j] =

M[i, j-1] – gM[i-1, j] – g

M[i-1, j-1] + score(X[i],Y[j])

jj-1

i-1

i

max

*

* -g

-g

X1…Xi-1 Xi

Y1…Yj-1 Yj

X1…Xi -Y1…Yj-1 Yj

X1…Xi-1 Xi

Y1…Yj -

Corresponds to:

Page 35: Pair-wise Sequence Alignment

Example: global alignment of two

sequences• Align two DNA sequences:

– GAGTGA– GAGGCGA (note the length difference)

• Parameters of the algorithm:– Match: score(A,A) = 1– Mismatch: score(A,T) = -1– Gap: g = -2

M[i, j] =

M[i, j-1] – 2M[i-1, j] – 2

M[i-1, j-1] ± 1max

Page 36: Pair-wise Sequence Alignment

The algorithm. Step 1: init• Create the matrix• Initiation

– 0 at [0,0]– Apply the

equation…

M[i, j] =

M[i, j-1] – 2M[i-1, j] – 2

M[i-1, j-1] ± 1max

6543210j

76543210i

AGCGGAG

0-AGTGAG-

Page 37: Pair-wise Sequence Alignment

The algorithm. Step 1: init

-14A-12G-10C-8G-6G-4A-2G

-12-10-8-6-4-20-AGTGAG-

• Initiation of the matrix:– 0 at pos [0,0]– Fill in the first row

using the “” rule– Fill in the first

column using “”

M[i, j] =

M[i, j-1] – 2M[i-1, j] – 2

M[i-1, j-1] ± 1max

j

i

Page 38: Pair-wise Sequence Alignment

The algorithm. Step 2: fill in

-14A-12G-10C-8G-6G

2-1-4A-3-11-2G

-12-10-8-6-4-20-AGTGAG-

• Continue filling in of the matrix, remembering from which cell the result comes (arrows)

M[i, j] =

M[i, j-1] – 2M[i-1, j] – 2

M[i-1, j-1] ± 1max

j

i

Page 39: Pair-wise Sequence Alignment

The algorithm. Step 2: fill in

2-1-4-5-8-11-14A01-2-3-6-9-12G110-1-4-7-10C0221-2-5-8G-3-1130-3-6G-6-4-202-1-4A-9-7-5-3-11-2G-12-10-8-6-4-20-AGTGAG-

• We are done…• Where’s the

result?

M[i, j] =

M[i, j-1] – 2M[i-1, j] – 2

M[i-1, j-1] ± 1max

j

iThe lowest-rightmost cell

Page 40: Pair-wise Sequence Alignment

The algorithm. Step 3: traceback

2-1-4-5-8-11-14A01-2-3-6-9-12G110-1-4-7-10C0221-2-5-8G-3-1130-3-6G-6-4-202-1-4A-9-7-5-3-11-2G-12-10-8-6-4-20-AGTGAG-

• Start at the last cell of the matrix

• Go against the direction of arrows

• Sometimes the value may be obtained from more than one cell (which one?)

j

i

M[i, j] =

M[i, j-1] – 2M[i-1, j] – 2

M[i-1, j-1] ± 1max

Page 41: Pair-wise Sequence Alignment

The algorithm. Step 3: traceback

• Extract the alignments

2-1-4-5-8-11-14A01-2-3-6-9-12G110-1-4-7-10C0221-2-5-8G-3-1130-3-6G-6-4-202-1-4A-9-7-5-3-11-2G-12-10-8-6-4-20-AGTGAG-

j

i

GAGT-GAGAGGCGA

GA-GTGAGAGGCGA

a

b

a)

b)

The ‘low’ and the ‘high’ road

You can also jump from the high to the low road in the middle (following the arrow): to what alignment does that lead?

Page 42: Pair-wise Sequence Alignment

Affine gaps

M[i, j] = Ix[i-1, j-1]

Iy[i-1, j-1]

M[i-1, j-1]

maxscore(X[i], Y[j]) +

Ix[i, j] =M[i, j-1] + go

Ix[i, j-1] + ge

max

Iy[i, j] =M[i-1, j] + go

Iy[i-1, j] + gx

max

• Penalties: go - gap opening (e.g. -8)ge - gap extension (e.g. -1)

X1… -Y1… Yj

X1…Xi -Y1…Yj-1 Yj

X1… - -Y1…Yj-1 Yj

X1… Xi

Y1… Yj

• @ home: think of boundary values M[*,0], I[*,0] etc.

Page 43: Pair-wise Sequence Alignment

Semi-global pairwise alignment

• Global alignment: all gaps are penalised• Semi-global alignment: N- and C-terminal (5’ and

3’) gaps (end-gaps) are not penalised

MSTGAVLIY--TS--------GGILLFHRTSGTSNS

End-gaps

End-gaps

Page 44: Pair-wise Sequence Alignment

Variation on global alignment

• Global alignment: previous algorithms is called global alignment, because it uses all letters from both sequences.

CAGCACTTGGATTCTCGGCAGC-----G-T----GG

• Semi-global alignment: don’t penalize for end gaps CAGCA-CTTGGATTCTCGG

---CAGCGTGG--------

Page 45: Pair-wise Sequence Alignment

Semi-global pairwise alignmentApplications of semi-global:– Finding a gene in genome– Placing marker onto a chromosome– One sequence much longer than the other Danger: really bad alignments for divergent

sequences (particularly if gap penalties are high)

Protein sequences have N- and C-terminal amino acids that are often small and hydrophilic, and so these ends match

seq X:

seq Y:

Page 46: Pair-wise Sequence Alignment

Semi-global alignment

• Ignore 5’ or N-terminal end gaps– First row/column set

to 0• Ignore C-terminal or

3’ end gaps– Read the result from

last row/column

TGA-

GTGAG-

1300-10-202-210-1-1-11-10000000

Page 47: Pair-wise Sequence Alignment

Take-home messages• Homology• Why are we interested in similarity?• Pairwise alignment: global alignment

and semi-global alignment• Three types of gap penalty regimes:

linear, affine and concave• Make sure you can calculate

alignment using the DP algorithm

Page 48: Pair-wise Sequence Alignment

• a heuristic– Heuristics: A rule of thumb that often helps in

solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work

Page 49: Pair-wise Sequence Alignment

Global dynamic programmingPAM250, Gap =6 (linear)

S P E A R E

0 -6 -12

-18 -24 -30 -36

S -6 2 -4 -10 -16 -22 -28

H -12 -4 2 -3 -9 -14 -20 -24

A -18 -10 -3 0 -1 -7 -13 -18

K -24 -16 -9 -3 -1 2 -4 -12

E -30 -22 -15

-5 -3 -2 6 -6

-30

-24 -18 -12 -6 0

S P E A R E

S 2 1 0 1 0 0

H -1 0 1 -1 2 1

A 1 1 0 2 -2 0

K 0 -1 0 -1 3 0

E 0 -1 4 0 -1 4

These values are copied from the PAM250 matrix (see earlier slide) The extra bottom row and rightmost column give the

penalties that would need to be applied due to end gaps

Higgs & Attwood, p. 124

Page 50: Pair-wise Sequence Alignment

Global dynamic programmingLinear, Affine or Concave gap penalties

i-1

j-1

Si,j = si,j + Max Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px}Si-1,j-1

Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}

Gap opening penalty

Gap extension penalty

All penalty schemes are possible because the exact length of the gap is known

Page 51: Pair-wise Sequence Alignment

Global dynamic programmingGapo=10, Gape=2

D W V T A L K

0 -12 -14

-16 -18 -20 -22 -24

T -12 8 -9 -6 -5 -9 -11 -14

D -14 0 9 2 2 3 -5 -3 -34

W -16 -13 25 11 5 4 9 0 -21

V -18 -10 -4 37 21 19 19 15 -16

L -20 -14 -2 23 46 31 37 26 1

K -22 -12 -9 17 33 53 39 50 14

-34

-29 -1 17 39 27 50

D W V T A L K

T 8 3 8 11 9 9 8

D 12 1 6 8 8 4 8

W 1 25 2 3 2 6 5

V 6 2 12 8 8 10 6

L 4 6 10 9 6 14 5

K 8 5 6 8 7 5 13

These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)

The extra bottom row and rightmost column give the final global alignment scores

Page 52: Pair-wise Sequence Alignment

Easy DP recipe for using affine gap penalties

• M[i,j] is optimal alignment (highest scoring alignment until [i,j])

• Check – preceding row until j-2: apply appropriate gap penalties– preceding row until i-2: apply appropriate gap penalties– and cell[i-1, j-1]: apply score for cell[i-1, j-1]

i-1

j-1

Page 53: Pair-wise Sequence Alignment

DP is a two-step process

• Forward step: calculate scores • Trace back: start at highest score and

reconstruct the path leading to the highest score– These two steps lead to the highest scoring

alignment (the optimal alignment)– This is guaranteed when you use DP!

Page 54: Pair-wise Sequence Alignment

Global dynamic programming

Page 55: Pair-wise Sequence Alignment

Semi-global dynamic programming- two examples with different gap penalties -

Global score is 65 –10 – 1*2 –10 – 2*2

These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)

Page 56: Pair-wise Sequence Alignment

Local dynamic programming (Smith & Waterman, 1981)

LCFVMLAGSTVIVGTREDASTILCGS

Amino AcidExchange Matrix

Gap penalties (open, extension)

Search matrix

Negativenumbers

AGSTVIVGA-STILCG

Page 57: Pair-wise Sequence Alignment

Local dynamic programming (Smith & Waterman, 1981)

i-1

j-1

Si,j = Max Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}Si,j + Si-1,j-1

Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}0

Gap opening penalty

Gap extension penalty

Page 58: Pair-wise Sequence Alignment

Local dynamic programming

Page 59: Pair-wise Sequence Alignment

Dot plots

• Way of representing (visualising) sequence similarity without doing dynamic programming (DP)

• Make same matrix, but locally represent sequence similarity by averaging using a window

Page 60: Pair-wise Sequence Alignment

Comparing two sequences

We want to be able to choose the best alignment between two sequences.

A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis.

Page 61: Pair-wise Sequence Alignment

Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold

They can identify insertions, deletions, inversions