pairwise sequence alignment · introduction basic definitions alignment algorithms on strings...

115
Introduction Basic definitions Alignment algorithms on strings Conclusion Pairwise sequence alignment Solon P. Pissis Tom´ s Flouri Heidelberg Institute for Theoretical Studies November 17, 2012

Upload: others

Post on 06-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Pairwise sequence alignment

Solon P. Pissis Tomas Flouri

Heidelberg Institute for Theoretical Studies

November 17, 2012

Page 2: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

1 IntroductionIntroduction

2 Basic definitionsAlphabet and stringsDistance metrics between stringsAlignment

3 Alignment algorithms on stringsEdit distanceGlobal alignmentLocal alignmentSubstitution matricesHamming distance

4 ConclusionOverview

Page 3: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction

2 Basic definitions

3 Alignment algorithms on strings

4 Conclusion

Page 4: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.

Page 5: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.Pairwise sequence alignment is the process of comparing onlytwo strings.

Page 6: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.Pairwise sequence alignment is the process of comparing onlytwo strings.Useful in dozens of biological applications; e.g. genome

assembly: taking a huge number of DNA sequences andputting them back together to create a representation of theoriginal chromosomes from which the DNA originated.

Page 7: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.Pairwise sequence alignment is the process of comparing onlytwo strings.Useful in dozens of biological applications; e.g. genome

assembly: taking a huge number of DNA sequences andputting them back together to create a representation of theoriginal chromosomes from which the DNA originated.

1 2 3 4 5 6 7 8 9

x = G C G A C G T C C

| | | | | . |y = G C G A − − T A C

Figure: Alignment between x = GCGACGTCC and y = GCGATAC: onemismatch at position 8 and a gap of length two inserted in y afterposition 4

Page 8: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

We focus on online sequence alignment — the sequencescannot be preprocessed to build an index on them.

Page 9: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

We focus on online sequence alignment — the sequencescannot be preprocessed to build an index on them.

There exist four main approaches to online sequencealignment: algorithms based on dynamic programming (DP);algorithms based on automata; algorithms based on word-levelparallelism; and algorithms based on filtering. We focus onalgorithms based on dynamic programming.

Page 10: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

We focus on online sequence alignment — the sequencescannot be preprocessed to build an index on them.

There exist four main approaches to online sequencealignment: algorithms based on dynamic programming (DP);algorithms based on automata; algorithms based on word-levelparallelism; and algorithms based on filtering. We focus onalgorithms based on dynamic programming.

There mainly exist two different distances for comparing twostrings: the edit distance (Damerau-Levenshtein distance) andthe Hamming distance.

Page 11: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

We focus on online sequence alignment — the sequencescannot be preprocessed to build an index on them.

There exist four main approaches to online sequencealignment: algorithms based on dynamic programming (DP);algorithms based on automata; algorithms based on word-levelparallelism; and algorithms based on filtering. We focus onalgorithms based on dynamic programming.

There mainly exist two different distances for comparing twostrings: the edit distance (Damerau-Levenshtein distance) andthe Hamming distance.

Biological applications require the modification of algorithmsmeasuring the distance between two strings in order toperform mainly two types of sequence alignment — local andglobal — between nucleotide (or protein) sequences.

Page 12: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

1 2 3 4 5 6 7 8 9 10 11

C G T C C G A A G T G

| . | | | |− − T A C G A A − − −

Table: Global alignment between x = CGTCCGAAGTG and y = TACGAA

3 4 5 6 7 8

T C C G A A

| . | | | |T A C G A A

Table: Local alignment between x = CGTCCGAAGTG and y = TACGAA

Page 13: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction

2 Basic definitions

3 Alignment algorithms on strings

4 Conclusion

Page 14: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Page 15: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Definition (String)

A string on an alphabet Σ is a finite, possibly empty, sequence ofelements of Σ.

Page 16: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Definition (String)

A string on an alphabet Σ is a finite, possibly empty, sequence ofelements of Σ.

The zero-letter sequence is called the empty string, and is denotedby ε.

Page 17: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Definition (String)

A string on an alphabet Σ is a finite, possibly empty, sequence ofelements of Σ.

The zero-letter sequence is called the empty string, and is denotedby ε.The set of all possible strings on the alphabet Σ is denoted by Σ∗.

Page 18: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Definition (String)

A string on an alphabet Σ is a finite, possibly empty, sequence ofelements of Σ.

The zero-letter sequence is called the empty string, and is denotedby ε.The set of all possible strings on the alphabet Σ is denoted by Σ∗.

Definition (Length of string)

The length of a string x is defined as the length of the sequenceassociated with the string x , and is denoted by |x |.

Page 19: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

We denote by x [i ], for all 0 ≤ i < |x |, the letter at index i of x .We also call index i , for all 0 ≤ i < |x |, a position in x whenx 6= ε. It follows that the ith letter of x is the letter at positioni − 1 in x , and that

x = x [0 . . |x | − 1]

Page 20: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

We denote by x [i ], for all 0 ≤ i < |x |, the letter at index i of x .We also call index i , for all 0 ≤ i < |x |, a position in x whenx 6= ε. It follows that the ith letter of x is the letter at positioni − 1 in x , and that

x = x [0 . . |x | − 1]

Definition (Identity between strings)

The identity between any two strings x and y is defined as

x = y

if and only if

|x | = |y | and x [i ] = y [i ], for all 0 ≤ i < |x |

Page 21: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Concatenation of strings)

The concatenation of two strings x and y is the string of theletters of x followed by the letters of y . It is denoted by xy .

Page 22: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Concatenation of strings)

The concatenation of two strings x and y is the string of theletters of x followed by the letters of y . It is denoted by xy .

Definition (Factor of string)

A string x is a factor (substring) of a string y if there exist twostrings u and v , such that y = uxv .

Page 23: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Concatenation of strings)

The concatenation of two strings x and y is the string of theletters of x followed by the letters of y . It is denoted by xy .

Definition (Factor of string)

A string x is a factor (substring) of a string y if there exist twostrings u and v , such that y = uxv .

Notice that u and v are possibly empty strings!

Page 24: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Concatenation of strings)

The concatenation of two strings x and y is the string of theletters of x followed by the letters of y . It is denoted by xy .

Definition (Factor of string)

A string x is a factor (substring) of a string y if there exist twostrings u and v , such that y = uxv .

Notice that u and v are possibly empty strings!

Definition (Occurrence of string)

Let x be a non-empty string and y be a string. We say that thereexists an occurrence of x in y , or, more simply, that x occurs in y ,when x is a factor of y .

Page 25: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Distance

Definition (Distance between two strings)

We say that a function δ : Σ∗ × Σ∗ → R is a distance on Σ∗ if thefour following properties are satisfied, for every u, v ∈ Σ∗:

Positivity: δ(u, v) ≥ 0

Separation: δ(u, v) = 0 if and only if u = v

Symmetry: δ(u, v) = δ(v , u)

Triangle inequality: δ(u, v) ≤ δ(u, w) + δ(w , v), for everyw ∈ Σ∗

Page 26: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Distance

Definition (Distance between two strings)

We say that a function δ : Σ∗ × Σ∗ → R is a distance on Σ∗ if thefour following properties are satisfied, for every u, v ∈ Σ∗:

Positivity: δ(u, v) ≥ 0

Separation: δ(u, v) = 0 if and only if u = v

Symmetry: δ(u, v) = δ(v , u)

Triangle inequality: δ(u, v) ≤ δ(u, w) + δ(w , v), for everyw ∈ Σ∗

The distances are defined from operations that transform x into y .Three types of elementary operations are considered.

substitution (sub) for a letter of x at a given position by aletter of y

deletion (del) of a letter of x at a given positioninsertion (ins) of a letter of y in x at a given position

Page 27: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Edit distance

We implicitly assume that the costs of edit operations areindependent of the positions at which the operations are realized,and that sub(a, b) := sub(b, a) := del(a) := ins(b) := 1, fora, b ∈ Σ, a 6= b.

Page 28: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Edit distance

We implicitly assume that the costs of edit operations areindependent of the positions at which the operations are realized,and that sub(a, b) := sub(b, a) := del(a) := ins(b) := 1, fora, b ∈ Σ, a 6= b.

Definition (Edit distance)

From the elementary costs, we set

δE = min{cost of σ : σ ∈ Sx ,y }

where Sx ,y is the set of sequences of elementary edit operationsthat transform x into y , and the cost of an element σ ∈ Sx ,y is thesum of the costs of the edit operations of the sequence σ. Thefunction δE is then a distance on Σ∗, and it is called the edit

distance (Damerau-Levenshtein distance).

Page 29: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Hamming distance

Definition (Hamming distance)

The Hamming distance, denoted by δH , is defined for two stringsof the same length as the number of positions in which the twostrings possess different letters. The Hamming distance is aparticular case of edit distance for which only the operation ofsubstitution is considered. This amounts to setdel(a) = ins(a) = +∞, for each a ∈ Σ.

Page 30: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alignment

Alignment

Definition (Alignment between two strings)

An alignment between x and y is a string z on the alphabet ofpairs of letters, more accurately on

(Σ ∪ {ε}) × (Σ ∪ {ε}) \ ({ε, ε})

whose projection on the first component is x , and the projectionon the second component is y . Thus, if z is an alignment of lengthp between x and y , we have

z = (x ′

0, y ′

0)(x′

1, y ′

1) . . . (x ′

p−1, y ′

p−1)

x = x ′

0x ′

1 . . . x ′

p−1

y = y ′

0y ′

1 . . . y ′

p−1

where x ′

i ∈ Σ ∪ {ε} and y ′

i ∈ Σ ∪ {ε}, for all 0 ≤ i < p.

Page 31: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alignment

Example

Example

Let the string x = ACGA and the string y = ATGCTA. An alignmentbetween x and y is

(

ACG--A

ATGCTA

)

Operation Aligned pair Cost

substitute A for A (A,A) 0substitute T for C (C,T) 1substitute G for G (G,G) 0insert C (-,C) 1insert T (-,T) 1substitute A for A (A,A) 0

This alignment is optimal since its cost is 3. Notice that the editdistance between the two strings is also 3.

Page 32: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction

2 Basic definitions

3 Alignment algorithms on strings

4 Conclusion

Page 33: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance

We focus on algorithms based on Dynamic Programming (DP).Let x and y be two strings of lengths m and n, respectively.The cells of the DP matrix T [0 . . m][0 . . n] can be computed bythe following formula

Page 34: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance

We focus on algorithms based on Dynamic Programming (DP).Let x and y be two strings of lengths m and n, respectively.The cells of the DP matrix T [0 . . m][0 . . n] can be computed bythe following formula

T [i ][j] =

0 : i = j = 0T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Page 35: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and

del(a) := 1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino acids

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Page 36: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and

del(a) := 1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino acids

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1A 2W 3A 4C 5Q 6G 7K 8L 9

Page 37: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and

del(a) := 1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino acids

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0A 2W 3A 4C 5Q 6G 7K 8L 9

Page 38: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and

del(a) := 1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino acids

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0 1 2 3 4 5 6 7 8 9 10 11A 2 1 2 3 2 3 4 5 6 7 8 9 10W 3 2 3 4 3 2 3 4 5 6 7 8 9A 4 3 4 5 4 3 4 5 6 7 8 9 10C 5 4 5 6 5 4 3 4 5 6 7 8 9Q 6 5 6 7 6 5 4 3 4 5 6 7 8G 7 6 7 8 7 6 5 4 5 4 5 6 7K 8 7 8 9 8 7 6 5 6 5 4 5 6L 9 8 9 10 9 8 7 6 7 6 5 6 7

Page 39: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0 1 2 3 4 5 6 7 8 9 10 11A 2 1 2 3 2 3 4 5 6 7 8 9 10W 3 2 3 4 3 2 3 4 5 6 7 8 9A 4 3 4 5 4 3 4 5 6 7 8 9 10C 5 4 5 6 5 4 3 4 5 6 7 8 9Q 6 5 6 7 6 5 4 3 4 5 6 7 8G 7 6 7 8 7 6 5 4 5 4 5 6 7K 8 7 8 9 8 7 6 5 6 5 4 5 6L 9 8 9 10 9 8 7 6 7 6 5 6 7

Page 40: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0 1 2 3 4 5 6 7 8 9 10 11A 2 1 2 3 2 3 4 5 6 7 8 9 10W 3 2 3 4 3 2 3 4 5 6 7 8 9A 4 3 4 5 4 3 4 5 6 7 8 9 10C 5 4 5 6 5 4 3 4 5 6 7 8 9Q 6 5 6 7 6 5 4 3 4 5 6 7 8G 7 6 7 8 7 6 5 4 5 4 5 6 7K 8 7 8 9 8 7 6 5 6 5 4 5 6L 9 8 9 10 9 8 7 6 7 6 5 6 7

(

E--AWACQ-GK--L

ERDAW-CQPGKWY-

)(

E--AWACQ-GK-L-

ERDAW-CQPGKW-Y

)(

E--AWACQ-GKL--

ERDAW-CQPGK-WY

)

Page 41: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where

a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the DNA alphabet.

Page 42: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where

a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the DNA alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Page 43: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where

a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the DNA alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

T - A T G C T A

- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5C 2 1 1 2 2 3 4G 3 2 2 1 2 3 4A 4 3 3 2 2 3 3

Page 44: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

T - A T G C T A

- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5C 2 1 1 2 2 3 4G 3 2 2 1 2 3 4A 4 3 3 2 2 3 3

Page 45: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

T - A T G C T A

- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5C 2 1 1 2 2 3 4G 3 2 2 1 2 3 4A 4 3 3 2 2 3 3

(

A--CGA

ATGCTA

)(

ACG--A

ATGCTA

)

Page 46: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

Page 47: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

Page 48: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

For the DP matrix T [0 . . m][0 . . n], there are m × n valuescomputed in this way.

Page 49: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

For the DP matrix T [0 . . m][0 . . n], there are m × n valuescomputed in this way.

The initialization phase requires time O(m + n).

Page 50: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

For the DP matrix T [0 . . m][0 . . n], there are m × n valuescomputed in this way.

The initialization phase requires time O(m + n).

Hence, table T can be computed in O(m × n) time and space.

Page 51: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

For the DP matrix T [0 . . m][0 . . n], there are m × n valuescomputed in this way.

The initialization phase requires time O(m + n).

Hence, table T can be computed in O(m × n) time and space.

For the space, it is sufficient to note that only a space of twocolumns (or two rows) is required. In case we are onlyinterested in the edit distance between the strings (but notthe alignment!), this can be computed in O(m × n) time andO(min(n, m)) space.

Page 52: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Page 53: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Sellers, however, showed in 1974 that the two problems areequivalent (Sellers, 1974).

Page 54: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Sellers, however, showed in 1974 that the two problems areequivalent (Sellers, 1974).

The notion of distance between two strings is not suitable forbiological applications.

Page 55: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Sellers, however, showed in 1974 that the two problems areequivalent (Sellers, 1974).

The notion of distance between two strings is not suitable forbiological applications.

We rather utilize a notion of similarity between strings forwhich the disimilarities are penalized and the similarities arefavored; i.e. sub(a, a) > 0, sub(a, b) < 0, ins(a) < 0,del(a) < 0 for a, b ∈ Σ, a 6= b.

Page 56: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Sellers, however, showed in 1974 that the two problems areequivalent (Sellers, 1974).

The notion of distance between two strings is not suitable forbiological applications.

We rather utilize a notion of similarity between strings forwhich the disimilarities are penalized and the similarities arefavored; i.e. sub(a, a) > 0, sub(a, b) < 0, ins(a) < 0,del(a) < 0 for a, b ∈ Σ, a 6= b.

This is known as the Needleman-Wunsch algorithm for globalsequence alignment.

Page 57: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Page 58: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Page 59: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Similarly, the notion of distance between two strings is notsuitable for biological applications.

Page 60: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Similarly, the notion of distance between two strings is notsuitable for biological applications.

Similarly, we rather utilize a notion of similarity betweenstrings for which the disimilarity is penalized and the similarityis favored.

Page 61: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Similarly, the notion of distance between two strings is notsuitable for biological applications.

Similarly, we rather utilize a notion of similarity betweenstrings for which the disimilarity is penalized and the similarityis favored.

The search for a similar substring consists then in maximizingthe similarity between the strings.

Page 62: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Similarly, the notion of distance between two strings is notsuitable for biological applications.

Similarly, we rather utilize a notion of similarity betweenstrings for which the disimilarity is penalized and the similarityis favored.

The search for a similar substring consists then in maximizingthe similarity between the strings.

This is known as the Smith-Waterman algorithm for localsequence alignment.

Page 63: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Let x and y be two strings of lengths m and n, respectively. Thecomputation of the cells of the DP matrix S[0 . . m][0 . . n] aredescribed by the following formula

Page 64: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Let x and y be two strings of lengths m and n, respectively. Thecomputation of the cells of the DP matrix S[0 . . m][0 . . n] aredescribed by the following formula

S[i ][j] =

0 : i = j = 00 : 0 < i ≤ m, j = 00 : 0 < j ≤ n, i = 0

max

0S[i − 1][j − 1] + sub(x [i ], y [j])

S[i − 1][j] + del(x [i ])S[i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Page 65: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Recall the formula for the global alignment!

Page 66: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Recall the formula for the global alignment!

T [i ][j] =

0 : i = j = 0T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Page 67: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3,

ins(a) := del(a) := −1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino

acids alphabet.

Page 68: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3,

ins(a) := del(a) := −1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino

acids alphabet.

S[i ][j] =

0 : i = j = 0

0 : 0 < i ≤ m, j = 0

0 : 0 < j ≤ n, i = 0

max

0

S[i − 1][j − 1] + sub(x [i ], y [j])S[i − 1][j] + del(x [i ])S[i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Page 69: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3,

ins(a) := del(a) := −1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino

acids alphabet.

S[i ][j] =

0 : i = j = 0

0 : 0 < i ≤ m, j = 0

0 : 0 < j ≤ n, i = 0

max

0

S[i − 1][j − 1] + sub(x [i ], y [j])S[i − 1][j] + del(x [i ])S[i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

S - E R D A W C Q P G K W Y

- 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 1 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 1 0 0 0 0 0 0 0 0W 0 0 0 0 0 2 1 0 0 0 0 1 0A 0 0 0 0 1 1 0 0 0 0 0 0 0C 0 0 0 0 0 0 2 1 0 0 0 0 0Q 0 0 0 0 0 0 1 3 2 1 0 0 0G 0 0 0 0 0 0 0 2 1 3 2 1 0K 0 0 0 0 0 0 0 1 0 2 4 3 2L 0 0 0 0 0 0 0 0 0 1 3 2 1

Page 70: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

Page 71: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

2 Traceback the path from the cell of this value by following thelargest value of the neighbor cells.

Page 72: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

2 Traceback the path from the cell of this value by following thelargest value of the neighbor cells.

3 Stop the scan on a zero value.

Page 73: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

2 Traceback the path from the cell of this value by following thelargest value of the neighbor cells.

3 Stop the scan on a zero value.

S - E R D A W C Q P G K W Y

- 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 1 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 1 0 0 0 0 0 0 0 0W 0 0 0 0 0 2 1 0 0 0 0 1 0A 0 0 0 0 1 1 0 0 0 0 0 0 0C 0 0 0 0 0 0 2 1 0 0 0 0 0Q 0 0 0 0 0 0 1 3 2 1 0 0 0G 0 0 0 0 0 0 0 2 1 3 2 1 0K 0 0 0 0 0 0 0 1 0 2 4 3 2L 0 0 0 0 0 0 0 0 0 1 3 2 1

Page 74: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

2 Traceback the path from the cell of this value by following thelargest value of the neighbor cells.

3 Stop the scan on a zero value.

S - E R D A W C Q P G K W Y

- 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 1 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 1 0 0 0 0 0 0 0 0W 0 0 0 0 0 2 1 0 0 0 0 1 0A 0 0 0 0 1 1 0 0 0 0 0 0 0C 0 0 0 0 0 0 2 1 0 0 0 0 0Q 0 0 0 0 0 0 1 3 2 1 0 0 0G 0 0 0 0 0 0 0 2 1 3 2 1 0K 0 0 0 0 0 0 0 1 0 2 4 3 2L 0 0 0 0 0 0 0 0 0 1 3 2 1

(

AWACQ-GK

AW-CQPGK

)

Page 75: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

Page 76: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

An improved version of this algorithm requires time O(m × n)(Gotoh, 1982).

Page 77: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

An improved version of this algorithm requires time O(m × n)(Gotoh, 1982).

An improved version of Gotoh’s algorithm requires spaceO(max(m, n)) (Myers and Miller, 1988).

Page 78: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

An improved version of this algorithm requires time O(m × n)(Gotoh, 1982).

An improved version of Gotoh’s algorithm requires spaceO(max(m, n)) (Myers and Miller, 1988).

This was inspired by Hirschberg’s paper from 1975 forcomputing the longest common subsequences in linear space(Hirschberg, 1975)—see the Unix command diff for details.

Page 79: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

An improved version of this algorithm requires time O(m × n)(Gotoh, 1982).

An improved version of Gotoh’s algorithm requires spaceO(max(m, n)) (Myers and Miller, 1988).

This was inspired by Hirschberg’s paper from 1975 forcomputing the longest common subsequences in linear space(Hirschberg, 1975)—see the Unix command diff for details.

The Hirschberg algorithm is based on the divide and conquerprinciple, which divides the DP matrix into smaller parts,solving each of these parts separately. The same idea can bedirectly applied to global alignment DP-based algorithms!!!

Page 80: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

A substitution matrix describes the rate at which onecharacter in a sequence changes to other character states overtime.

Page 81: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

A substitution matrix describes the rate at which onecharacter in a sequence changes to other character states overtime.

Usually seen in the context of amino acid or DNA sequencealignments.

Page 82: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

A substitution matrix describes the rate at which onecharacter in a sequence changes to other character states overtime.

Usually seen in the context of amino acid or DNA sequencealignments.

The similarity between sequences depends on their divergencetime and the substitution rates as represented in the matrix.

Page 83: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

A substitution matrix describes the rate at which onecharacter in a sequence changes to other character states overtime.

Usually seen in the context of amino acid or DNA sequencealignments.

The similarity between sequences depends on their divergencetime and the substitution rates as represented in the matrix.

For example, in the process of evolution, from one generationto the next the amino acid sequences of an organism’s proteinsare gradually altered through the action of DNA mutations.

Page 84: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Page 85: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Several sets of BLOSUM matrices exist using differentalignment databases, named with numbers.

Page 86: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Several sets of BLOSUM matrices exist using differentalignment databases, named with numbers.

BLOSUM matrices were first introduced by Henikoff andHenikoff (Henikoff and Henikoff, 1992).

Page 87: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Several sets of BLOSUM matrices exist using differentalignment databases, named with numbers.

BLOSUM matrices were first introduced by Henikoff andHenikoff (Henikoff and Henikoff, 1992).

All BLOSUM matrices are based on observed (empirical) localalignments.

Page 88: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Several sets of BLOSUM matrices exist using differentalignment databases, named with numbers.

BLOSUM matrices were first introduced by Henikoff andHenikoff (Henikoff and Henikoff, 1992).

All BLOSUM matrices are based on observed (empirical) localalignments.

BLOSUM matrices with high numbers, e.g. BLOSUM80, aredesigned for comparing closely related sequences, while thosewith low numbers, e.g. BLOSUM45, are designed forcomparing distant related sequences.

Page 89: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

BLOCKS database is a database containing multiplealignments of conserved regions in protein families.

Page 90: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

BLOCKS database is a database containing multiplealignments of conserved regions in protein families.

Henikoff and Henikoff scanned the BLOCKS database for veryconserved regions of protein families that do not have gaps inthe sequence alignment.

Page 91: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

BLOCKS database is a database containing multiplealignments of conserved regions in protein families.

Henikoff and Henikoff scanned the BLOCKS database for veryconserved regions of protein families that do not have gaps inthe sequence alignment.

Then they counted the relative frequencies of amino acids andtheir substitution probabilities.

Page 92: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

BLOCKS database is a database containing multiplealignments of conserved regions in protein families.

Henikoff and Henikoff scanned the BLOCKS database for veryconserved regions of protein families that do not have gaps inthe sequence alignment.

Then they counted the relative frequencies of amino acids andtheir substitution probabilities.

Finally, they calculated a log-odds score for each of the 210possible substitutions of the 20 standard amino acids.

Page 93: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

Scores within a BLOSUM matrix are log-odds scores thatmeasure, in an alignment, the logarithm for the ratio of thelikelihood of two amino acids appearing with a biological senseand the likelihood of the same amino acids appearing bychance.

Page 94: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

Scores within a BLOSUM matrix are log-odds scores thatmeasure, in an alignment, the logarithm for the ratio of thelikelihood of two amino acids appearing with a biological senseand the likelihood of the same amino acids appearing bychance.

Every possible match or substitution is assigned a score basedon its observed frequences in the alignment of related proteins.

Page 95: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

Scores within a BLOSUM matrix are log-odds scores thatmeasure, in an alignment, the logarithm for the ratio of thelikelihood of two amino acids appearing with a biological senseand the likelihood of the same amino acids appearing bychance.

Every possible match or substitution is assigned a score basedon its observed frequences in the alignment of related proteins.

A positive score is given to the more likely substitutions whilea negative score is given to the less likely substitutions.

Page 96: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM62

- C S T P A G N D E Q H R K M I L V F Y W

C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Page 97: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k

be the maximum number of allowed errors (maximum Hammingdistance).

Page 98: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k

be the maximum number of allowed errors (maximum Hammingdistance).The computation of the cells of the DP matrix D[0 . . m][0 . . n] aredescribed by the following formula

Page 99: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k

be the maximum number of allowed errors (maximum Hammingdistance).The computation of the cells of the DP matrix D[0 . . m][0 . . n] aredescribed by the following formula

D[i ][j] =

k + 1 : 0 < i ≤ m, j = 00 : 0 ≤ j ≤ n, i = 0

D[i − 1][j − 1] : x [i ] = y [j], 0 < i ≤ m, 0 < j ≤ n

D[i − 1][j − 1] + 1 : x [i ] 6= y [j], 0 < i ≤ m, 0 < j ≤ n

Page 100: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3.

Page 101: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3.

D[i ][j] =

k + 1 : 0 < i ≤ m, j = 00 : 0 ≤ j ≤ n, i = 0

D[i − 1][j − 1] : x [i ] = y [j], 0 < i ≤ m, 0 < j ≤ n

D[i − 1][j − 1] + 1 : x [i ] 6= y [j], 0 < i ≤ m, 0 < j ≤ n

Page 102: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3.

D[i ][j] =

k + 1 : 0 < i ≤ m, j = 00 : 0 ≤ j ≤ n, i = 0

D[i − 1][j − 1] : x [i ] = y [j], 0 < i ≤ m, 0 < j ≤ n

D[i − 1][j − 1] + 1 : x [i ] 6= y [j], 0 < i ≤ m, 0 < j ≤ n

D - A D C A B C A A B A D B B C A

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0D 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2B 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3B 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4C 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4A 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0

Page 103: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

D - A D C A B C A A B A D B B C A

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0D 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2B 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3B 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4C 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4A 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0

Page 104: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

D - A D C A B C A A B A D B B C A

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0D 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2B 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3B 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4C 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4A 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0

D C A B C A

. . . | | |A D B B C A

A D B B C A

| | | | | |A D B B C A

Page 105: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction

2 Basic definitions

3 Alignment algorithms on strings

4 Conclusion

Page 106: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

Page 107: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

There exist two main distances for comparing two strings —the edit distance and the Hamming distance.

Page 108: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

There exist two main distances for comparing two strings —the edit distance and the Hamming distance.

A different formulation of the edit distance is to maximize thesimilarity of the two strings — global alignment(Needleman-Wunsch algorithm) — instead of minimizing thedistance between the two strings.

Page 109: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

There exist two main distances for comparing two strings —the edit distance and the Hamming distance.

A different formulation of the edit distance is to maximize thesimilarity of the two strings — global alignment(Needleman-Wunsch algorithm) — instead of minimizing thedistance between the two strings.

Instead of considering a global alignment between two strings,in biological applications it is often more relevant to determinea best alignment between substrings of the two strings —local alignment (Smith-Waterman algorithm).

Page 110: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

There exist two main distances for comparing two strings —the edit distance and the Hamming distance.

A different formulation of the edit distance is to maximize thesimilarity of the two strings — global alignment(Needleman-Wunsch algorithm) — instead of minimizing thedistance between the two strings.

Instead of considering a global alignment between two strings,in biological applications it is often more relevant to determinea best alignment between substrings of the two strings —local alignment (Smith-Waterman algorithm).

The Hamming distance is a particular case of edit distance forwhich only the operation of substitution is considered.

Page 111: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

N. Alachiotis, S. Berger, and A. Stamatakis.Coupling SIMD and SIMT architectures to boost performanceof a phylogeny-aware alignment kernel.BMC Bioinformatics, 13:196, 2012.

F. J. Damerau.A technique for computer detection and correction of spellingerrors.Commun. ACM, 7(3):171–176, 1964.

M. Farrar.Striped smith–waterman speeds database searches six timesover other simd implementations.Bioinformatics, 23(2):156–161, 2007.

O. Gotoh.An improved algorithm for matching biological sequences.Journal of molecular biology, 162(3):705–708, 1982.

Page 112: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

S. Henikoff and J. G. Henikoff.Amino acid substitution matrices from protein blocks.Proceedings of the National Academy of Sciences of the

United States of America, 89(22):10915–10919, 1992.

D. S. Hirschberg.A linear space algorithm for computing maximal commonsubsequences.Commun. ACM, 18(6):341–343, 1975.

V. I. Levenshtein.Binary codes capable of correcting deletions, insertions, andreversals.Technical Report 8, Soviet Physics Doklady, 1966.

W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig.Streaming Algorithms for Biological Sequence Alignment onGPUs.IEEE Trans. Parallel Distrib. Syst., 18(9):1270–1281, 2007.

Page 113: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

S. Manavski and G. Valle.CUDA compatible GPU cards as efficient hardware acceleratorsfor Smith-Waterman sequence alignment.BMC Bioinformatics, 9(S-2), 2008.

E. W. Myers and W. Miller.Optimal alignments in linear space.Computer Applications in the Biosciences, 4(1):11–17, 1988.

S. B. Needleman and C. D. Wunsch.A general method applicable to the search for similarities inthe amino acid sequence of two proteins.Journal of Molecular Biology, 48(3):443–453, 1970.

T. Rognes.Faster Smith-Waterman database searches with inter-sequenceSIMD parallelisation.BMC Bioinformatics, 12:221, 2011.

Page 114: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

T. Rognes and E. Seeberg.Six-fold speed-up of Smith-Waterman sequence databasesearches using parallel processing on common microprocessors.Bioinformatics, 16(8):699–706, 2000.

D. Sankoff.Matching Sequences under Deletion/Insertion Constraints.Proceedings of the National Academy of Sciences of the

United States of America, 69(1):4–6, 1972.

P. H. Sellers.On the theory and computation of evolutionary distances.SIAM Journal on Applied Mathematics, 26(4):787–793, 1974.

T. Vintsyuk.Speech discrimination by dynamic programming.Cybernetics, 4:52–57, 1968.

M. S. Waterman and T. F. Smith.

Page 115: Pairwise sequence alignment · Introduction Basic definitions Alignment algorithms on strings Conclusion Introduction Introduction Sequence alignment is the process of comparing

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Identification of common molecular subsequences.Journal of Molecular Biology, 147(1):195–197, 1981.