pairwise sequence alignment · introduction basic definitions alignment algorithms on strings...

Post on 06-Jun-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction Basic definitions Alignment algorithms on strings Conclusion

Pairwise sequence alignment

Solon P. Pissis Tomas Flouri

Heidelberg Institute for Theoretical Studies

November 17, 2012

Introduction Basic definitions Alignment algorithms on strings Conclusion

1 IntroductionIntroduction

2 Basic definitionsAlphabet and stringsDistance metrics between stringsAlignment

3 Alignment algorithms on stringsEdit distanceGlobal alignmentLocal alignmentSubstitution matricesHamming distance

4 ConclusionOverview

Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction

2 Basic definitions

3 Alignment algorithms on strings

4 Conclusion

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.Pairwise sequence alignment is the process of comparing onlytwo strings.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.Pairwise sequence alignment is the process of comparing onlytwo strings.Useful in dozens of biological applications; e.g. genome

assembly: taking a huge number of DNA sequences andputting them back together to create a representation of theoriginal chromosomes from which the DNA originated.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.Pairwise sequence alignment is the process of comparing onlytwo strings.Useful in dozens of biological applications; e.g. genome

assembly: taking a huge number of DNA sequences andputting them back together to create a representation of theoriginal chromosomes from which the DNA originated.

1 2 3 4 5 6 7 8 9

x = G C G A C G T C C

| | | | | . |y = G C G A − − T A C

Figure: Alignment between x = GCGACGTCC and y = GCGATAC: onemismatch at position 8 and a gap of length two inserted in y afterposition 4

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

We focus on online sequence alignment — the sequencescannot be preprocessed to build an index on them.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

We focus on online sequence alignment — the sequencescannot be preprocessed to build an index on them.

There exist four main approaches to online sequencealignment: algorithms based on dynamic programming (DP);algorithms based on automata; algorithms based on word-levelparallelism; and algorithms based on filtering. We focus onalgorithms based on dynamic programming.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

We focus on online sequence alignment — the sequencescannot be preprocessed to build an index on them.

There exist four main approaches to online sequencealignment: algorithms based on dynamic programming (DP);algorithms based on automata; algorithms based on word-levelparallelism; and algorithms based on filtering. We focus onalgorithms based on dynamic programming.

There mainly exist two different distances for comparing twostrings: the edit distance (Damerau-Levenshtein distance) andthe Hamming distance.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

We focus on online sequence alignment — the sequencescannot be preprocessed to build an index on them.

There exist four main approaches to online sequencealignment: algorithms based on dynamic programming (DP);algorithms based on automata; algorithms based on word-levelparallelism; and algorithms based on filtering. We focus onalgorithms based on dynamic programming.

There mainly exist two different distances for comparing twostrings: the edit distance (Damerau-Levenshtein distance) andthe Hamming distance.

Biological applications require the modification of algorithmsmeasuring the distance between two strings in order toperform mainly two types of sequence alignment — local andglobal — between nucleotide (or protein) sequences.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Introduction

Introduction

1 2 3 4 5 6 7 8 9 10 11

C G T C C G A A G T G

| . | | | |− − T A C G A A − − −

Table: Global alignment between x = CGTCCGAAGTG and y = TACGAA

3 4 5 6 7 8

T C C G A A

| . | | | |T A C G A A

Table: Local alignment between x = CGTCCGAAGTG and y = TACGAA

Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction

2 Basic definitions

3 Alignment algorithms on strings

4 Conclusion

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Definition (String)

A string on an alphabet Σ is a finite, possibly empty, sequence ofelements of Σ.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Definition (String)

A string on an alphabet Σ is a finite, possibly empty, sequence ofelements of Σ.

The zero-letter sequence is called the empty string, and is denotedby ε.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Definition (String)

A string on an alphabet Σ is a finite, possibly empty, sequence ofelements of Σ.

The zero-letter sequence is called the empty string, and is denotedby ε.The set of all possible strings on the alphabet Σ is denoted by Σ∗.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Definition (String)

A string on an alphabet Σ is a finite, possibly empty, sequence ofelements of Σ.

The zero-letter sequence is called the empty string, and is denotedby ε.The set of all possible strings on the alphabet Σ is denoted by Σ∗.

Definition (Length of string)

The length of a string x is defined as the length of the sequenceassociated with the string x , and is denoted by |x |.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

We denote by x [i ], for all 0 ≤ i < |x |, the letter at index i of x .We also call index i , for all 0 ≤ i < |x |, a position in x whenx 6= ε. It follows that the ith letter of x is the letter at positioni − 1 in x , and that

x = x [0 . . |x | − 1]

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

We denote by x [i ], for all 0 ≤ i < |x |, the letter at index i of x .We also call index i , for all 0 ≤ i < |x |, a position in x whenx 6= ε. It follows that the ith letter of x is the letter at positioni − 1 in x , and that

x = x [0 . . |x | − 1]

Definition (Identity between strings)

The identity between any two strings x and y is defined as

x = y

if and only if

|x | = |y | and x [i ] = y [i ], for all 0 ≤ i < |x |

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Concatenation of strings)

The concatenation of two strings x and y is the string of theletters of x followed by the letters of y . It is denoted by xy .

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Concatenation of strings)

The concatenation of two strings x and y is the string of theletters of x followed by the letters of y . It is denoted by xy .

Definition (Factor of string)

A string x is a factor (substring) of a string y if there exist twostrings u and v , such that y = uxv .

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Concatenation of strings)

The concatenation of two strings x and y is the string of theletters of x followed by the letters of y . It is denoted by xy .

Definition (Factor of string)

A string x is a factor (substring) of a string y if there exist twostrings u and v , such that y = uxv .

Notice that u and v are possibly empty strings!

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alphabet and strings

Alphabet and strings

Definition (Concatenation of strings)

The concatenation of two strings x and y is the string of theletters of x followed by the letters of y . It is denoted by xy .

Definition (Factor of string)

A string x is a factor (substring) of a string y if there exist twostrings u and v , such that y = uxv .

Notice that u and v are possibly empty strings!

Definition (Occurrence of string)

Let x be a non-empty string and y be a string. We say that thereexists an occurrence of x in y , or, more simply, that x occurs in y ,when x is a factor of y .

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Distance

Definition (Distance between two strings)

We say that a function δ : Σ∗ × Σ∗ → R is a distance on Σ∗ if thefour following properties are satisfied, for every u, v ∈ Σ∗:

Positivity: δ(u, v) ≥ 0

Separation: δ(u, v) = 0 if and only if u = v

Symmetry: δ(u, v) = δ(v , u)

Triangle inequality: δ(u, v) ≤ δ(u, w) + δ(w , v), for everyw ∈ Σ∗

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Distance

Definition (Distance between two strings)

We say that a function δ : Σ∗ × Σ∗ → R is a distance on Σ∗ if thefour following properties are satisfied, for every u, v ∈ Σ∗:

Positivity: δ(u, v) ≥ 0

Separation: δ(u, v) = 0 if and only if u = v

Symmetry: δ(u, v) = δ(v , u)

Triangle inequality: δ(u, v) ≤ δ(u, w) + δ(w , v), for everyw ∈ Σ∗

The distances are defined from operations that transform x into y .Three types of elementary operations are considered.

substitution (sub) for a letter of x at a given position by aletter of y

deletion (del) of a letter of x at a given positioninsertion (ins) of a letter of y in x at a given position

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Edit distance

We implicitly assume that the costs of edit operations areindependent of the positions at which the operations are realized,and that sub(a, b) := sub(b, a) := del(a) := ins(b) := 1, fora, b ∈ Σ, a 6= b.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Edit distance

We implicitly assume that the costs of edit operations areindependent of the positions at which the operations are realized,and that sub(a, b) := sub(b, a) := del(a) := ins(b) := 1, fora, b ∈ Σ, a 6= b.

Definition (Edit distance)

From the elementary costs, we set

δE = min{cost of σ : σ ∈ Sx ,y }

where Sx ,y is the set of sequences of elementary edit operationsthat transform x into y , and the cost of an element σ ∈ Sx ,y is thesum of the costs of the edit operations of the sequence σ. Thefunction δE is then a distance on Σ∗, and it is called the edit

distance (Damerau-Levenshtein distance).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Distance metrics between strings

Hamming distance

Definition (Hamming distance)

The Hamming distance, denoted by δH , is defined for two stringsof the same length as the number of positions in which the twostrings possess different letters. The Hamming distance is aparticular case of edit distance for which only the operation ofsubstitution is considered. This amounts to setdel(a) = ins(a) = +∞, for each a ∈ Σ.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alignment

Alignment

Definition (Alignment between two strings)

An alignment between x and y is a string z on the alphabet ofpairs of letters, more accurately on

(Σ ∪ {ε}) × (Σ ∪ {ε}) \ ({ε, ε})

whose projection on the first component is x , and the projectionon the second component is y . Thus, if z is an alignment of lengthp between x and y , we have

z = (x ′

0, y ′

0)(x′

1, y ′

1) . . . (x ′

p−1, y ′

p−1)

x = x ′

0x ′

1 . . . x ′

p−1

y = y ′

0y ′

1 . . . y ′

p−1

where x ′

i ∈ Σ ∪ {ε} and y ′

i ∈ Σ ∪ {ε}, for all 0 ≤ i < p.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Alignment

Example

Example

Let the string x = ACGA and the string y = ATGCTA. An alignmentbetween x and y is

(

ACG--A

ATGCTA

)

Operation Aligned pair Cost

substitute A for A (A,A) 0substitute T for C (C,T) 1substitute G for G (G,G) 0insert C (-,C) 1insert T (-,T) 1substitute A for A (A,A) 0

This alignment is optimal since its cost is 3. Notice that the editdistance between the two strings is also 3.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction

2 Basic definitions

3 Alignment algorithms on strings

4 Conclusion

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance

We focus on algorithms based on Dynamic Programming (DP).Let x and y be two strings of lengths m and n, respectively.The cells of the DP matrix T [0 . . m][0 . . n] can be computed bythe following formula

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance

We focus on algorithms based on Dynamic Programming (DP).Let x and y be two strings of lengths m and n, respectively.The cells of the DP matrix T [0 . . m][0 . . n] can be computed bythe following formula

T [i ][j] =

0 : i = j = 0T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and

del(a) := 1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino acids

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and

del(a) := 1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino acids

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1A 2W 3A 4C 5Q 6G 7K 8L 9

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and

del(a) := 1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino acids

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0A 2W 3A 4C 5Q 6G 7K 8L 9

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and

del(a) := 1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino acids

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0 1 2 3 4 5 6 7 8 9 10 11A 2 1 2 3 2 3 4 5 6 7 8 9 10W 3 2 3 4 3 2 3 4 5 6 7 8 9A 4 3 4 5 4 3 4 5 6 7 8 9 10C 5 4 5 6 5 4 3 4 5 6 7 8 9Q 6 5 6 7 6 5 4 3 4 5 6 7 8G 7 6 7 8 7 6 5 4 5 4 5 6 7K 8 7 8 9 8 7 6 5 6 5 4 5 6L 9 8 9 10 9 8 7 6 7 6 5 6 7

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0 1 2 3 4 5 6 7 8 9 10 11A 2 1 2 3 2 3 4 5 6 7 8 9 10W 3 2 3 4 3 2 3 4 5 6 7 8 9A 4 3 4 5 4 3 4 5 6 7 8 9 10C 5 4 5 6 5 4 3 4 5 6 7 8 9Q 6 5 6 7 6 5 4 3 4 5 6 7 8G 7 6 7 8 7 6 5 4 5 4 5 6 7K 8 7 8 9 8 7 6 5 6 5 4 5 6L 9 8 9 10 9 8 7 6 7 6 5 6 7

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 1

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0 1 2 3 4 5 6 7 8 9 10 11A 2 1 2 3 2 3 4 5 6 7 8 9 10W 3 2 3 4 3 2 3 4 5 6 7 8 9A 4 3 4 5 4 3 4 5 6 7 8 9 10C 5 4 5 6 5 4 3 4 5 6 7 8 9Q 6 5 6 7 6 5 4 3 4 5 6 7 8G 7 6 7 8 7 6 5 4 5 4 5 6 7K 8 7 8 9 8 7 6 5 6 5 4 5 6L 9 8 9 10 9 8 7 6 7 6 5 6 7

(

E--AWACQ-GK--L

ERDAW-CQPGKWY-

)(

E--AWACQ-GK-L-

ERDAW-CQPGKW-Y

)(

E--AWACQ-GKL--

ERDAW-CQPGK-WY

)

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where

a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the DNA alphabet.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where

a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the DNA alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where

a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the DNA alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

{

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

T - A T G C T A

- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5C 2 1 1 2 2 3 4G 3 2 2 1 2 3 4A 4 3 3 2 2 3 3

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

T - A T G C T A

- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5C 2 1 1 2 2 3 4G 3 2 2 1 2 3 4A 4 3 3 2 2 3 3

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Example 2

T - A T G C T A

- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5C 2 1 1 2 2 3 4G 3 2 2 1 2 3 4A 4 3 3 2 2 3 3

(

A--CGA

ATGCTA

)(

ACG--A

ATGCTA

)

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

For the DP matrix T [0 . . m][0 . . n], there are m × n valuescomputed in this way.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

For the DP matrix T [0 . . m][0 . . n], there are m × n valuescomputed in this way.

The initialization phase requires time O(m + n).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

For the DP matrix T [0 . . m][0 . . n], there are m × n valuescomputed in this way.

The initialization phase requires time O(m + n).

Hence, table T can be computed in O(m × n) time and space.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

For the DP matrix T [0 . . m][0 . . n], there are m × n valuescomputed in this way.

The initialization phase requires time O(m + n).

Hence, table T can be computed in O(m × n) time and space.

For the space, it is sufficient to note that only a space of twocolumns (or two rows) is required. In case we are onlyinterested in the edit distance between the strings (but notthe alignment!), this can be computed in O(m × n) time andO(min(n, m)) space.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Sellers, however, showed in 1974 that the two problems areequivalent (Sellers, 1974).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Sellers, however, showed in 1974 that the two problems areequivalent (Sellers, 1974).

The notion of distance between two strings is not suitable forbiological applications.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Sellers, however, showed in 1974 that the two problems areequivalent (Sellers, 1974).

The notion of distance between two strings is not suitable forbiological applications.

We rather utilize a notion of similarity between strings forwhich the disimilarities are penalized and the similarities arefavored; i.e. sub(a, a) > 0, sub(a, b) < 0, ins(a) < 0,del(a) < 0 for a, b ∈ Σ, a 6= b.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Sellers, however, showed in 1974 that the two problems areequivalent (Sellers, 1974).

The notion of distance between two strings is not suitable forbiological applications.

We rather utilize a notion of similarity between strings forwhich the disimilarities are penalized and the similarities arefavored; i.e. sub(a, a) > 0, sub(a, b) < 0, ins(a) < 0,del(a) < 0 for a, b ∈ Σ, a 6= b.

This is known as the Needleman-Wunsch algorithm for globalsequence alignment.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Similarly, the notion of distance between two strings is notsuitable for biological applications.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Similarly, the notion of distance between two strings is notsuitable for biological applications.

Similarly, we rather utilize a notion of similarity betweenstrings for which the disimilarity is penalized and the similarityis favored.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Similarly, the notion of distance between two strings is notsuitable for biological applications.

Similarly, we rather utilize a notion of similarity betweenstrings for which the disimilarity is penalized and the similarityis favored.

The search for a similar substring consists then in maximizingthe similarity between the strings.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Similarly, the notion of distance between two strings is notsuitable for biological applications.

Similarly, we rather utilize a notion of similarity betweenstrings for which the disimilarity is penalized and the similarityis favored.

The search for a similar substring consists then in maximizingthe similarity between the strings.

This is known as the Smith-Waterman algorithm for localsequence alignment.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Let x and y be two strings of lengths m and n, respectively. Thecomputation of the cells of the DP matrix S[0 . . m][0 . . n] aredescribed by the following formula

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Let x and y be two strings of lengths m and n, respectively. Thecomputation of the cells of the DP matrix S[0 . . m][0 . . n] aredescribed by the following formula

S[i ][j] =

0 : i = j = 00 : 0 < i ≤ m, j = 00 : 0 < j ≤ n, i = 0

max

0S[i − 1][j − 1] + sub(x [i ], y [j])

S[i − 1][j] + del(x [i ])S[i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Recall the formula for the global alignment!

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm & Local alignment

Recall the formula for the global alignment!

T [i ][j] =

0 : i = j = 0T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

min

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3,

ins(a) := del(a) := −1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino

acids alphabet.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3,

ins(a) := del(a) := −1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino

acids alphabet.

S[i ][j] =

0 : i = j = 0

0 : 0 < i ≤ m, j = 0

0 : 0 < j ≤ n, i = 0

max

0

S[i − 1][j − 1] + sub(x [i ], y [j])S[i − 1][j] + del(x [i ])S[i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3,

ins(a) := del(a) := −1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino

acids alphabet.

S[i ][j] =

0 : i = j = 0

0 : 0 < i ≤ m, j = 0

0 : 0 < j ≤ n, i = 0

max

0

S[i − 1][j − 1] + sub(x [i ], y [j])S[i − 1][j] + del(x [i ])S[i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

S - E R D A W C Q P G K W Y

- 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 1 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 1 0 0 0 0 0 0 0 0W 0 0 0 0 0 2 1 0 0 0 0 1 0A 0 0 0 0 1 1 0 0 0 0 0 0 0C 0 0 0 0 0 0 2 1 0 0 0 0 0Q 0 0 0 0 0 0 1 3 2 1 0 0 0G 0 0 0 0 0 0 0 2 1 3 2 1 0K 0 0 0 0 0 0 0 1 0 2 4 3 2L 0 0 0 0 0 0 0 0 0 1 3 2 1

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

2 Traceback the path from the cell of this value by following thelargest value of the neighbor cells.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

2 Traceback the path from the cell of this value by following thelargest value of the neighbor cells.

3 Stop the scan on a zero value.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

2 Traceback the path from the cell of this value by following thelargest value of the neighbor cells.

3 Stop the scan on a zero value.

S - E R D A W C Q P G K W Y

- 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 1 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 1 0 0 0 0 0 0 0 0W 0 0 0 0 0 2 1 0 0 0 0 1 0A 0 0 0 0 1 1 0 0 0 0 0 0 0C 0 0 0 0 0 0 2 1 0 0 0 0 0Q 0 0 0 0 0 0 1 3 2 1 0 0 0G 0 0 0 0 0 0 0 2 1 3 2 1 0K 0 0 0 0 0 0 0 1 0 2 4 3 2L 0 0 0 0 0 0 0 0 0 1 3 2 1

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Example

1 Locate one among the equally largest values in table S.

2 Traceback the path from the cell of this value by following thelargest value of the neighbor cells.

3 Stop the scan on a zero value.

S - E R D A W C Q P G K W Y

- 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 1 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 1 0 0 0 0 0 0 0 0W 0 0 0 0 0 2 1 0 0 0 0 1 0A 0 0 0 0 1 1 0 0 0 0 0 0 0C 0 0 0 0 0 0 2 1 0 0 0 0 0Q 0 0 0 0 0 0 1 3 2 1 0 0 0G 0 0 0 0 0 0 0 2 1 3 2 1 0K 0 0 0 0 0 0 0 1 0 2 4 3 2L 0 0 0 0 0 0 0 0 0 1 3 2 1

(

AWACQ-GK

AW-CQPGK

)

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

An improved version of this algorithm requires time O(m × n)(Gotoh, 1982).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

An improved version of this algorithm requires time O(m × n)(Gotoh, 1982).

An improved version of Gotoh’s algorithm requires spaceO(max(m, n)) (Myers and Miller, 1988).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

An improved version of this algorithm requires time O(m × n)(Gotoh, 1982).

An improved version of Gotoh’s algorithm requires spaceO(max(m, n)) (Myers and Miller, 1988).

This was inspired by Hirschberg’s paper from 1975 forcomputing the longest common subsequences in linear space(Hirschberg, 1975)—see the Unix command diff for details.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

An improved version of this algorithm requires time O(m × n)(Gotoh, 1982).

An improved version of Gotoh’s algorithm requires spaceO(max(m, n)) (Myers and Miller, 1988).

This was inspired by Hirschberg’s paper from 1975 forcomputing the longest common subsequences in linear space(Hirschberg, 1975)—see the Unix command diff for details.

The Hirschberg algorithm is based on the divide and conquerprinciple, which divides the DP matrix into smaller parts,solving each of these parts separately. The same idea can bedirectly applied to global alignment DP-based algorithms!!!

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

A substitution matrix describes the rate at which onecharacter in a sequence changes to other character states overtime.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

A substitution matrix describes the rate at which onecharacter in a sequence changes to other character states overtime.

Usually seen in the context of amino acid or DNA sequencealignments.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

A substitution matrix describes the rate at which onecharacter in a sequence changes to other character states overtime.

Usually seen in the context of amino acid or DNA sequencealignments.

The similarity between sequences depends on their divergencetime and the substitution rates as represented in the matrix.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

A substitution matrix describes the rate at which onecharacter in a sequence changes to other character states overtime.

Usually seen in the context of amino acid or DNA sequencealignments.

The similarity between sequences depends on their divergencetime and the substitution rates as represented in the matrix.

For example, in the process of evolution, from one generationto the next the amino acid sequences of an organism’s proteinsare gradually altered through the action of DNA mutations.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Several sets of BLOSUM matrices exist using differentalignment databases, named with numbers.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Several sets of BLOSUM matrices exist using differentalignment databases, named with numbers.

BLOSUM matrices were first introduced by Henikoff andHenikoff (Henikoff and Henikoff, 1992).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Several sets of BLOSUM matrices exist using differentalignment databases, named with numbers.

BLOSUM matrices were first introduced by Henikoff andHenikoff (Henikoff and Henikoff, 1992).

All BLOSUM matrices are based on observed (empirical) localalignments.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Several sets of BLOSUM matrices exist using differentalignment databases, named with numbers.

BLOSUM matrices were first introduced by Henikoff andHenikoff (Henikoff and Henikoff, 1992).

All BLOSUM matrices are based on observed (empirical) localalignments.

BLOSUM matrices with high numbers, e.g. BLOSUM80, aredesigned for comparing closely related sequences, while thosewith low numbers, e.g. BLOSUM45, are designed forcomparing distant related sequences.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

BLOCKS database is a database containing multiplealignments of conserved regions in protein families.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

BLOCKS database is a database containing multiplealignments of conserved regions in protein families.

Henikoff and Henikoff scanned the BLOCKS database for veryconserved regions of protein families that do not have gaps inthe sequence alignment.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

BLOCKS database is a database containing multiplealignments of conserved regions in protein families.

Henikoff and Henikoff scanned the BLOCKS database for veryconserved regions of protein families that do not have gaps inthe sequence alignment.

Then they counted the relative frequencies of amino acids andtheir substitution probabilities.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

BLOCKS database is a database containing multiplealignments of conserved regions in protein families.

Henikoff and Henikoff scanned the BLOCKS database for veryconserved regions of protein families that do not have gaps inthe sequence alignment.

Then they counted the relative frequencies of amino acids andtheir substitution probabilities.

Finally, they calculated a log-odds score for each of the 210possible substitutions of the 20 standard amino acids.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

Scores within a BLOSUM matrix are log-odds scores thatmeasure, in an alignment, the logarithm for the ratio of thelikelihood of two amino acids appearing with a biological senseand the likelihood of the same amino acids appearing bychance.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

Scores within a BLOSUM matrix are log-odds scores thatmeasure, in an alignment, the logarithm for the ratio of thelikelihood of two amino acids appearing with a biological senseand the likelihood of the same amino acids appearing bychance.

Every possible match or substitution is assigned a score basedon its observed frequences in the alignment of related proteins.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM

Scores within a BLOSUM matrix are log-odds scores thatmeasure, in an alignment, the logarithm for the ratio of thelikelihood of two amino acids appearing with a biological senseand the likelihood of the same amino acids appearing bychance.

Every possible match or substitution is assigned a score basedon its observed frequences in the alignment of related proteins.

A positive score is given to the more likely substitutions whilea negative score is given to the less likely substitutions.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Substitution matrices

Substitution matrices – BLOSUM62

- C S T P A G N D E Q H R K M I L V F Y W

C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k

be the maximum number of allowed errors (maximum Hammingdistance).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k

be the maximum number of allowed errors (maximum Hammingdistance).The computation of the cells of the DP matrix D[0 . . m][0 . . n] aredescribed by the following formula

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k

be the maximum number of allowed errors (maximum Hammingdistance).The computation of the cells of the DP matrix D[0 . . m][0 . . n] aredescribed by the following formula

D[i ][j] =

k + 1 : 0 < i ≤ m, j = 00 : 0 ≤ j ≤ n, i = 0

D[i − 1][j − 1] : x [i ] = y [j], 0 < i ≤ m, 0 < j ≤ n

D[i − 1][j − 1] + 1 : x [i ] 6= y [j], 0 < i ≤ m, 0 < j ≤ n

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3.

D[i ][j] =

k + 1 : 0 < i ≤ m, j = 00 : 0 ≤ j ≤ n, i = 0

D[i − 1][j − 1] : x [i ] = y [j], 0 < i ≤ m, 0 < j ≤ n

D[i − 1][j − 1] + 1 : x [i ] 6= y [j], 0 < i ≤ m, 0 < j ≤ n

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3.

D[i ][j] =

k + 1 : 0 < i ≤ m, j = 00 : 0 ≤ j ≤ n, i = 0

D[i − 1][j − 1] : x [i ] = y [j], 0 < i ≤ m, 0 < j ≤ n

D[i − 1][j − 1] + 1 : x [i ] 6= y [j], 0 < i ≤ m, 0 < j ≤ n

D - A D C A B C A A B A D B B C A

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0D 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2B 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3B 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4C 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4A 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

D - A D C A B C A A B A D B B C A

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0D 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2B 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3B 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4C 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4A 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0

Introduction Basic definitions Alignment algorithms on strings Conclusion

Hamming distance

Hamming distance - Example

D - A D C A B C A A B A D B B C A

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0D 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2B 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3B 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4C 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4A 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0

D C A B C A

. . . | | |A D B B C A

A D B B C A

| | | | | |A D B B C A

Introduction Basic definitions Alignment algorithms on strings Conclusion

Contents

1 Introduction

2 Basic definitions

3 Alignment algorithms on strings

4 Conclusion

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

There exist two main distances for comparing two strings —the edit distance and the Hamming distance.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

There exist two main distances for comparing two strings —the edit distance and the Hamming distance.

A different formulation of the edit distance is to maximize thesimilarity of the two strings — global alignment(Needleman-Wunsch algorithm) — instead of minimizing thedistance between the two strings.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

There exist two main distances for comparing two strings —the edit distance and the Hamming distance.

A different formulation of the edit distance is to maximize thesimilarity of the two strings — global alignment(Needleman-Wunsch algorithm) — instead of minimizing thedistance between the two strings.

Instead of considering a global alignment between two strings,in biological applications it is often more relevant to determinea best alignment between substrings of the two strings —local alignment (Smith-Waterman algorithm).

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

There exist two main distances for comparing two strings —the edit distance and the Hamming distance.

A different formulation of the edit distance is to maximize thesimilarity of the two strings — global alignment(Needleman-Wunsch algorithm) — instead of minimizing thedistance between the two strings.

Instead of considering a global alignment between two strings,in biological applications it is often more relevant to determinea best alignment between substrings of the two strings —local alignment (Smith-Waterman algorithm).

The Hamming distance is a particular case of edit distance forwhich only the operation of substitution is considered.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

N. Alachiotis, S. Berger, and A. Stamatakis.Coupling SIMD and SIMT architectures to boost performanceof a phylogeny-aware alignment kernel.BMC Bioinformatics, 13:196, 2012.

F. J. Damerau.A technique for computer detection and correction of spellingerrors.Commun. ACM, 7(3):171–176, 1964.

M. Farrar.Striped smith–waterman speeds database searches six timesover other simd implementations.Bioinformatics, 23(2):156–161, 2007.

O. Gotoh.An improved algorithm for matching biological sequences.Journal of molecular biology, 162(3):705–708, 1982.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

S. Henikoff and J. G. Henikoff.Amino acid substitution matrices from protein blocks.Proceedings of the National Academy of Sciences of the

United States of America, 89(22):10915–10919, 1992.

D. S. Hirschberg.A linear space algorithm for computing maximal commonsubsequences.Commun. ACM, 18(6):341–343, 1975.

V. I. Levenshtein.Binary codes capable of correcting deletions, insertions, andreversals.Technical Report 8, Soviet Physics Doklady, 1966.

W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig.Streaming Algorithms for Biological Sequence Alignment onGPUs.IEEE Trans. Parallel Distrib. Syst., 18(9):1270–1281, 2007.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

S. Manavski and G. Valle.CUDA compatible GPU cards as efficient hardware acceleratorsfor Smith-Waterman sequence alignment.BMC Bioinformatics, 9(S-2), 2008.

E. W. Myers and W. Miller.Optimal alignments in linear space.Computer Applications in the Biosciences, 4(1):11–17, 1988.

S. B. Needleman and C. D. Wunsch.A general method applicable to the search for similarities inthe amino acid sequence of two proteins.Journal of Molecular Biology, 48(3):443–453, 1970.

T. Rognes.Faster Smith-Waterman database searches with inter-sequenceSIMD parallelisation.BMC Bioinformatics, 12:221, 2011.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

T. Rognes and E. Seeberg.Six-fold speed-up of Smith-Waterman sequence databasesearches using parallel processing on common microprocessors.Bioinformatics, 16(8):699–706, 2000.

D. Sankoff.Matching Sequences under Deletion/Insertion Constraints.Proceedings of the National Academy of Sciences of the

United States of America, 69(1):4–6, 1972.

P. H. Sellers.On the theory and computation of evolutionary distances.SIAM Journal on Applied Mathematics, 26(4):787–793, 1974.

T. Vintsyuk.Speech discrimination by dynamic programming.Cybernetics, 4:52–57, 1968.

M. S. Waterman and T. F. Smith.

Introduction Basic definitions Alignment algorithms on strings Conclusion

Overview

Identification of common molecular subsequences.Journal of Molecular Biology, 147(1):195–197, 1981.

top related