pairwise sequence alignment · introduction basic deﬁnitions alignment algorithms on strings...

Introduction Basic definitions Alignment algorithms on strings Conclusion

Pairwise sequence alignment

Solon P. Pissis Tomas Flouri

Heidelberg Institute for Theoretical Studies

November 17, 2012

1 IntroductionIntroduction

2 Basic definitionsAlphabet and stringsDistance metrics between stringsAlignment

3 Alignment algorithms on stringsEdit distanceGlobal alignmentLocal alignmentSubstitution matricesHamming distance

4 ConclusionOverview

Contents

1 Introduction

2 Basic definitions

3 Alignment algorithms on strings

4 Conclusion

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.Pairwise sequence alignment is the process of comparing onlytwo strings.

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.Pairwise sequence alignment is the process of comparing onlytwo strings.Useful in dozens of biological applications; e.g. genome

assembly: taking a huge number of DNA sequences andputting them back together to create a representation of theoriginal chromosomes from which the DNA originated.

Introduction

Sequence alignment is the process of comparing two or morestrings of letters (e.g. nucleotides or amino acids) to infertheir similarity.Pairwise sequence alignment is the process of comparing onlytwo strings.Useful in dozens of biological applications; e.g. genome

assembly: taking a huge number of DNA sequences andputting them back together to create a representation of theoriginal chromosomes from which the DNA originated.

1 2 3 4 5 6 7 8 9

x = G C G A C G T C C

| | | | | . |y = G C G A − − T A C

Figure: Alignment between x = GCGACGTCC and y = GCGATAC: onemismatch at position 8 and a gap of length two inserted in y afterposition 4

Introduction

We focus on online sequence alignment — the sequencescannot be preprocessed to build an index on them.

Introduction

There exist four main approaches to online sequencealignment: algorithms based on dynamic programming (DP);algorithms based on automata; algorithms based on word-levelparallelism; and algorithms based on filtering. We focus onalgorithms based on dynamic programming.

Introduction

There mainly exist two different distances for comparing twostrings: the edit distance (Damerau-Levenshtein distance) andthe Hamming distance.

Introduction

There mainly exist two different distances for comparing twostrings: the edit distance (Damerau-Levenshtein distance) andthe Hamming distance.

Biological applications require the modification of algorithmsmeasuring the distance between two strings in order toperform mainly two types of sequence alignment — local andglobal — between nucleotide (or protein) sequences.

Introduction

1 2 3 4 5 6 7 8 9 10 11

C G T C C G A A G T G

| . | | | |− − T A C G A A − − −

Table: Global alignment between x = CGTCCGAAGTG and y = TACGAA

3 4 5 6 7 8

T C C G A A

| . | | | |T A C G A A

Table: Local alignment between x = CGTCCGAAGTG and y = TACGAA

Contents

1 Introduction

2 Basic definitions

4 Conclusion

Alphabet and strings

Definition (Alphabet)

An alphabet Σ is a finite non-empty set whose elements are calledletters.

Definition (String)

A string on an alphabet Σ is a finite, possibly empty, sequence ofelements of Σ.

Definition (String)

The zero-letter sequence is called the empty string, and is denotedby ε.

Definition (String)

The zero-letter sequence is called the empty string, and is denotedby ε.The set of all possible strings on the alphabet Σ is denoted by Σ∗.

Definition (String)

The zero-letter sequence is called the empty string, and is denotedby ε.The set of all possible strings on the alphabet Σ is denoted by Σ∗.

Definition (Length of string)

The length of a string x is defined as the length of the sequenceassociated with the string x , and is denoted by |x |.

We denote by x [i ], for all 0 ≤ i < |x |, the letter at index i of x .We also call index i , for all 0 ≤ i < |x |, a position in x whenx 6= ε. It follows that the ith letter of x is the letter at positioni − 1 in x , and that

x = x [0 . . |x | − 1]

We denote by x [i ], for all 0 ≤ i < |x |, the letter at index i of x .We also call index i , for all 0 ≤ i < |x |, a position in x whenx 6= ε. It follows that the ith letter of x is the letter at positioni − 1 in x , and that

x = x [0 . . |x | − 1]

Definition (Identity between strings)

The identity between any two strings x and y is defined as

if and only if

|x | = |y | and x [i ] = y [i ], for all 0 ≤ i < |x |

Definition (Concatenation of strings)

The concatenation of two strings x and y is the string of theletters of x followed by the letters of y . It is denoted by xy .

Definition (Factor of string)

A string x is a factor (substring) of a string y if there exist twostrings u and v , such that y = uxv .

Notice that u and v are possibly empty strings!

Definition (Occurrence of string)

Let x be a non-empty string and y be a string. We say that thereexists an occurrence of x in y , or, more simply, that x occurs in y ,when x is a factor of y .

Distance metrics between strings

Distance

Definition (Distance between two strings)

We say that a function δ : Σ∗ × Σ∗ → R is a distance on Σ∗ if thefour following properties are satisfied, for every u, v ∈ Σ∗:

Positivity: δ(u, v) ≥ 0

Separation: δ(u, v) = 0 if and only if u = v

Symmetry: δ(u, v) = δ(v , u)

Triangle inequality: δ(u, v) ≤ δ(u, w) + δ(w , v), for everyw ∈ Σ∗

Distance

Definition (Distance between two strings)

We say that a function δ : Σ∗ × Σ∗ → R is a distance on Σ∗ if thefour following properties are satisfied, for every u, v ∈ Σ∗:

Positivity: δ(u, v) ≥ 0

Separation: δ(u, v) = 0 if and only if u = v

Symmetry: δ(u, v) = δ(v , u)

Triangle inequality: δ(u, v) ≤ δ(u, w) + δ(w , v), for everyw ∈ Σ∗

The distances are defined from operations that transform x into y .Three types of elementary operations are considered.

substitution (sub) for a letter of x at a given position by aletter of y

deletion (del) of a letter of x at a given positioninsertion (ins) of a letter of y in x at a given position

Edit distance

We implicitly assume that the costs of edit operations areindependent of the positions at which the operations are realized,and that sub(a, b) := sub(b, a) := del(a) := ins(b) := 1, fora, b ∈ Σ, a 6= b.

Edit distance

We implicitly assume that the costs of edit operations areindependent of the positions at which the operations are realized,and that sub(a, b) := sub(b, a) := del(a) := ins(b) := 1, fora, b ∈ Σ, a 6= b.

Definition (Edit distance)

From the elementary costs, we set

δE = min{cost of σ : σ ∈ Sx ,y }

where Sx ,y is the set of sequences of elementary edit operationsthat transform x into y , and the cost of an element σ ∈ Sx ,y is thesum of the costs of the edit operations of the sequence σ. Thefunction δE is then a distance on Σ∗, and it is called the edit

distance (Damerau-Levenshtein distance).

Hamming distance

Definition (Hamming distance)

The Hamming distance, denoted by δH , is defined for two stringsof the same length as the number of positions in which the twostrings possess different letters. The Hamming distance is aparticular case of edit distance for which only the operation ofsubstitution is considered. This amounts to setdel(a) = ins(a) = +∞, for each a ∈ Σ.

Alignment

Definition (Alignment between two strings)

An alignment between x and y is a string z on the alphabet ofpairs of letters, more accurately on

(Σ ∪ {ε}) × (Σ ∪ {ε}) \ ({ε, ε})

whose projection on the first component is x , and the projectionon the second component is y . Thus, if z is an alignment of lengthp between x and y , we have

z = (x ′

0, y ′

0)(x′

1, y ′

1) . . . (x ′

p−1, y ′

p−1)

x = x ′

0x ′

1 . . . x ′

y = y ′

0y ′

1 . . . y ′

where x ′

i ∈ Σ ∪ {ε} and y ′

i ∈ Σ ∪ {ε}, for all 0 ≤ i < p.

Alignment

Example

Let the string x = ACGA and the string y = ATGCTA. An alignmentbetween x and y is

ACG--A

ATGCTA

Operation Aligned pair Cost

substitute A for A (A,A) 0substitute T for C (C,T) 1substitute G for G (G,G) 0insert C (-,C) 1insert T (-,T) 1substitute A for A (A,A) 0

This alignment is optimal since its cost is 3. Notice that the editdistance between the two strings is also 3.

Contents

1 Introduction

2 Basic definitions

4 Conclusion

Edit distance

We focus on algorithms based on Dynamic Programming (DP).Let x and y be two strings of lengths m and n, respectively.The cells of the DP matrix T [0 . . m][0 . . n] can be computed bythe following formula

Edit distance

We focus on algorithms based on Dynamic Programming (DP).Let x and y be two strings of lengths m and n, respectively.The cells of the DP matrix T [0 . . m][0 . . n] can be computed bythe following formula

T [i ][j] =

0 : i = j = 0T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

T [i − 1][j − 1] + sub(x [i ], y [j])T [i − 1][j] + del(x [i ])T [i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Edit distance

Edit distance - Example 1

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, b) := 3, ins(a) := 1, and

del(a) := 1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino acids

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

: 0 < i ≤ m, 0 < j ≤ n

Edit distance

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

: 0 < i ≤ m, 0 < j ≤ n

T - E R D A W C Q P G K W Y

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1A 2W 3A 4C 5Q 6G 7K 8L 9

Edit distance

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

: 0 < i ≤ m, 0 < j ≤ n

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0A 2W 3A 4C 5Q 6G 7K 8L 9

Edit distance

alphabet.

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

: 0 < i ≤ m, 0 < j ≤ n

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0 1 2 3 4 5 6 7 8 9 10 11A 2 1 2 3 2 3 4 5 6 7 8 9 10W 3 2 3 4 3 2 3 4 5 6 7 8 9A 4 3 4 5 4 3 4 5 6 7 8 9 10C 5 4 5 6 5 4 3 4 5 6 7 8 9Q 6 5 6 7 6 5 4 3 4 5 6 7 8G 7 6 7 8 7 6 5 4 5 4 5 6 7K 8 7 8 9 8 7 6 5 6 5 4 5 6L 9 8 9 10 9 8 7 6 7 6 5 6 7

Edit distance

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0 1 2 3 4 5 6 7 8 9 10 11A 2 1 2 3 2 3 4 5 6 7 8 9 10W 3 2 3 4 3 2 3 4 5 6 7 8 9A 4 3 4 5 4 3 4 5 6 7 8 9 10C 5 4 5 6 5 4 3 4 5 6 7 8 9Q 6 5 6 7 6 5 4 3 4 5 6 7 8G 7 6 7 8 7 6 5 4 5 4 5 6 7K 8 7 8 9 8 7 6 5 6 5 4 5 6L 9 8 9 10 9 8 7 6 7 6 5 6 7

Edit distance

- 0 1 2 3 4 5 6 7 8 9 10 11 12E 1 0 1 2 3 4 5 6 7 8 9 10 11A 2 1 2 3 2 3 4 5 6 7 8 9 10W 3 2 3 4 3 2 3 4 5 6 7 8 9A 4 3 4 5 4 3 4 5 6 7 8 9 10C 5 4 5 6 5 4 3 4 5 6 7 8 9Q 6 5 6 7 6 5 4 3 4 5 6 7 8G 7 6 7 8 7 6 5 4 5 4 5 6 7K 8 7 8 9 8 7 6 5 6 5 4 5 6L 9 8 9 10 9 8 7 6 7 6 5 6 7

E--AWACQ-GK--L

ERDAW-CQPGKWY-

E--AWACQ-GK-L-

ERDAW-CQPGKW-Y

E--AWACQ-GKL--

ERDAW-CQPGK-WY

Edit distance

Let x = ACGA, y = ATGCTA, sub(a, b) := 1, ins(a) := 1, and del(a) := 1, where

a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the DNA alphabet.

Edit distance

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

: 0 < i ≤ m, 0 < j ≤ n

Edit distance

T [i ][j] =

0 : i = j = 0

T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0

T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

: 0 < i ≤ m, 0 < j ≤ n

T - A T G C T A

- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5C 2 1 1 2 2 3 4G 3 2 2 1 2 3 4A 4 3 3 2 2 3 3

Edit distance

T - A T G C T A

- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5C 2 1 1 2 2 3 4G 3 2 2 1 2 3 4A 4 3 3 2 2 3 3

Edit distance

T - A T G C T A

- 0 1 2 3 4 5 6A 1 0 1 2 3 4 5C 2 1 1 2 2 3 4G 3 2 2 1 2 3 4A 4 3 3 2 2 3 3

A--CGA

ATGCTA

ACG--A

ATGCTA

Edit distance

Edit distance - Complexities

The first algorithm for solving this problem has beenrediscovered many times in the past in different fields(Vintsyuk, 1968; Needleman-Wunsch, 1970; Sankoff, 1972;Sellers, 1974; etc).

Edit distance

The computation of the value of each cell of the table T

depends only on the three neighbour cells - O(1).

Edit distance

For the DP matrix T [0 . . m][0 . . n], there are m × n valuescomputed in this way.

Edit distance

The initialization phase requires time O(m + n).

Edit distance

Hence, table T can be computed in O(m × n) time and space.

Edit distance

Hence, table T can be computed in O(m × n) time and space.

For the space, it is sufficient to note that only a space of twocolumns (or two rows) is required. In case we are onlyinterested in the edit distance between the strings (but notthe alignment!), this can be computed in O(m × n) time andO(min(n, m)) space.

Global alignment

Needleman-Wunsch algorithm & Global alignment

Needleman and Wunsch simply re-formulated the edit distanceproblem (Damerau, 1964; Levenshtein, 1966) in terms ofmaximizing similarity (Needleman and Wunsch, 1970).

Global alignment

Sellers, however, showed in 1974 that the two problems areequivalent (Sellers, 1974).

Global alignment

The notion of distance between two strings is not suitable forbiological applications.

Global alignment

We rather utilize a notion of similarity between strings forwhich the disimilarities are penalized and the similarities arefavored; i.e. sub(a, a) > 0, sub(a, b) < 0, ins(a) < 0,del(a) < 0 for a, b ∈ Σ, a 6= b.

Global alignment

We rather utilize a notion of similarity between strings forwhich the disimilarities are penalized and the similarities arefavored; i.e. sub(a, a) > 0, sub(a, b) < 0, ins(a) < 0,del(a) < 0 for a, b ∈ Σ, a 6= b.

This is known as the Needleman-Wunsch algorithm for globalsequence alignment.

Local alignment

Smith-Waterman algorithm & Local alignment

Instead of considering a global alignment between x and y , inmolecular biology it is often more relevant to determine a bestalignment between a substring of x and a substring of y .

Local alignment

Local alignments are more useful for dissimilar sequences thatare suspected to contain regions of similarity or similarsequence motifs within their larger sequence context.

Local alignment

Similarly, the notion of distance between two strings is notsuitable for biological applications.

Local alignment

Similarly, we rather utilize a notion of similarity betweenstrings for which the disimilarity is penalized and the similarityis favored.

Local alignment

The search for a similar substring consists then in maximizingthe similarity between the strings.

Local alignment

The search for a similar substring consists then in maximizingthe similarity between the strings.

This is known as the Smith-Waterman algorithm for localsequence alignment.

Local alignment

Let x and y be two strings of lengths m and n, respectively. Thecomputation of the cells of the DP matrix S[0 . . m][0 . . n] aredescribed by the following formula

Local alignment

Let x and y be two strings of lengths m and n, respectively. Thecomputation of the cells of the DP matrix S[0 . . m][0 . . n] aredescribed by the following formula

S[i ][j] =

0 : i = j = 00 : 0 < i ≤ m, j = 00 : 0 < j ≤ n, i = 0

0S[i − 1][j − 1] + sub(x [i ], y [j])

S[i − 1][j] + del(x [i ])S[i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Local alignment

Recall the formula for the global alignment!

Local alignment

Recall the formula for the global alignment!

T [i ][j] =

0 : i = j = 0T [i − 1][j] + ins(y [j]) : 0 < i ≤ m, j = 0T [i ][j − 1] + del(x [i ]) : 0 < j ≤ n, i = 0

: 0 < i ≤ m, 0 < j ≤ n

Local alignment

Smith-Waterman algorithm - Example

Let x = EAWACQGKL, y = ERDAWCQPGKWY, sub(a, a) := 1, sub(a, b) := −3,

ins(a) := del(a) := −1, where a, b ∈ Σ, a 6= b, x , y ∈ Σ∗, and Σ is the amino

acids alphabet.

Local alignment

acids alphabet.

S[i ][j] =

0 : i = j = 0

0 : 0 < i ≤ m, j = 0

0 : 0 < j ≤ n, i = 0

S[i − 1][j − 1] + sub(x [i ], y [j])S[i − 1][j] + del(x [i ])S[i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

Local alignment

acids alphabet.

S[i ][j] =

0 : i = j = 0

0 : 0 < i ≤ m, j = 0

0 : 0 < j ≤ n, i = 0

S[i − 1][j − 1] + sub(x [i ], y [j])S[i − 1][j] + del(x [i ])S[i ][j − 1] + ins(y [j])

: 0 < i ≤ m, 0 < j ≤ n

S - E R D A W C Q P G K W Y

- 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 1 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 1 0 0 0 0 0 0 0 0W 0 0 0 0 0 2 1 0 0 0 0 1 0A 0 0 0 0 1 1 0 0 0 0 0 0 0C 0 0 0 0 0 0 2 1 0 0 0 0 0Q 0 0 0 0 0 0 1 3 2 1 0 0 0G 0 0 0 0 0 0 0 2 1 3 2 1 0K 0 0 0 0 0 0 0 1 0 2 4 3 2L 0 0 0 0 0 0 0 0 0 1 3 2 1

Local alignment

1 Locate one among the equally largest values in table S.

Local alignment

2 Traceback the path from the cell of this value by following thelargest value of the neighbor cells.

Local alignment

3 Stop the scan on a zero value.

Local alignment

- 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 1 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 1 0 0 0 0 0 0 0 0W 0 0 0 0 0 2 1 0 0 0 0 1 0A 0 0 0 0 1 1 0 0 0 0 0 0 0C 0 0 0 0 0 0 2 1 0 0 0 0 0Q 0 0 0 0 0 0 1 3 2 1 0 0 0G 0 0 0 0 0 0 0 2 1 3 2 1 0K 0 0 0 0 0 0 0 1 0 2 4 3 2L 0 0 0 0 0 0 0 0 0 1 3 2 1

Local alignment

- 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 1 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 1 0 0 0 0 0 0 0 0W 0 0 0 0 0 2 1 0 0 0 0 1 0A 0 0 0 0 1 1 0 0 0 0 0 0 0C 0 0 0 0 0 0 2 1 0 0 0 0 0Q 0 0 0 0 0 0 1 3 2 1 0 0 0G 0 0 0 0 0 0 0 2 1 3 2 1 0K 0 0 0 0 0 0 0 1 0 2 4 3 2L 0 0 0 0 0 0 0 0 0 1 3 2 1

AWACQ-GK

AW-CQPGK

Local alignment

Smith-Waterman algorithm - Complexities

A naive implementation of Smith-Waterman algorithm tocompute an optimal alignment requires O(m2 × n) time andspace O(m × n) (Smith and Waterman, 1981).

Local alignment

An improved version of this algorithm requires time O(m × n)(Gotoh, 1982).

Local alignment

An improved version of Gotoh’s algorithm requires spaceO(max(m, n)) (Myers and Miller, 1988).

Local alignment

This was inspired by Hirschberg’s paper from 1975 forcomputing the longest common subsequences in linear space(Hirschberg, 1975)—see the Unix command diff for details.

Local alignment

This was inspired by Hirschberg’s paper from 1975 forcomputing the longest common subsequences in linear space(Hirschberg, 1975)—see the Unix command diff for details.

The Hirschberg algorithm is based on the divide and conquerprinciple, which divides the DP matrix into smaller parts,solving each of these parts separately. The same idea can bedirectly applied to global alignment DP-based algorithms!!!

Substitution matrices

A substitution matrix describes the rate at which onecharacter in a sequence changes to other character states overtime.

Usually seen in the context of amino acid or DNA sequencealignments.

The similarity between sequences depends on their divergencetime and the substitution rates as represented in the matrix.

For example, in the process of evolution, from one generationto the next the amino acid sequences of an organism’s proteinsare gradually altered through the action of DNA mutations.

The BLOSUM (BLOcks of Amino Acid SUbstitution Matrix)matrix is a substitution matrix used for sequence alignment ofprotein sequences.

Several sets of BLOSUM matrices exist using differentalignment databases, named with numbers.

BLOSUM matrices were first introduced by Henikoff andHenikoff (Henikoff and Henikoff, 1992).

All BLOSUM matrices are based on observed (empirical) localalignments.

BLOSUM matrices with high numbers, e.g. BLOSUM80, aredesigned for comparing closely related sequences, while thosewith low numbers, e.g. BLOSUM45, are designed forcomparing distant related sequences.

Substitution matrices – BLOSUM

BLOCKS database is a database containing multiplealignments of conserved regions in protein families.

Henikoff and Henikoff scanned the BLOCKS database for veryconserved regions of protein families that do not have gaps inthe sequence alignment.

Then they counted the relative frequencies of amino acids andtheir substitution probabilities.

Finally, they calculated a log-odds score for each of the 210possible substitutions of the 20 standard amino acids.

Scores within a BLOSUM matrix are log-odds scores thatmeasure, in an alignment, the logarithm for the ratio of thelikelihood of two amino acids appearing with a biological senseand the likelihood of the same amino acids appearing bychance.

Every possible match or substitution is assigned a score basedon its observed frequences in the alignment of related proteins.

A positive score is given to the more likely substitutions whilea negative score is given to the less likely substitutions.

Substitution matrices – BLOSUM62

- C S T P A G N D E Q H R K M I L V F Y W

C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Hamming distance

Let x and y be two strings of lengths m and n, respectively, and k

be the maximum number of allowed errors (maximum Hammingdistance).

Hamming distance

be the maximum number of allowed errors (maximum Hammingdistance).The computation of the cells of the DP matrix D[0 . . m][0 . . n] aredescribed by the following formula

Hamming distance

be the maximum number of allowed errors (maximum Hammingdistance).The computation of the cells of the DP matrix D[0 . . m][0 . . n] aredescribed by the following formula

D[i ][j] =

k + 1 : 0 < i ≤ m, j = 00 : 0 ≤ j ≤ n, i = 0

D[i − 1][j − 1] : x [i ] = y [j], 0 < i ≤ m, 0 < j ≤ n

D[i − 1][j − 1] + 1 : x [i ] 6= y [j], 0 < i ≤ m, 0 < j ≤ n

Hamming distance

Hamming distance - Example

Let x = ADBBCA, y = ADCABCAABADBBCA, and k = 3.

Hamming distance

D[i ][j] =

k + 1 : 0 < i ≤ m, j = 00 : 0 ≤ j ≤ n, i = 0

D[i − 1][j − 1] : x [i ] = y [j], 0 < i ≤ m, 0 < j ≤ n

D[i − 1][j − 1] + 1 : x [i ] 6= y [j], 0 < i ≤ m, 0 < j ≤ n

Hamming distance

D[i ][j] =

k + 1 : 0 < i ≤ m, j = 00 : 0 ≤ j ≤ n, i = 0

D[i − 1][j − 1] : x [i ] = y [j], 0 < i ≤ m, 0 < j ≤ n

D[i − 1][j − 1] + 1 : x [i ] 6= y [j], 0 < i ≤ m, 0 < j ≤ n

D - A D C A B C A A B A D B B C A

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0D 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2B 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3B 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4C 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4A 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0

Hamming distance

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0D 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2B 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3B 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4C 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4A 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0

Hamming distance

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 4 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0D 4 5 0 2 2 1 2 2 1 1 2 0 2 2 2 2B 4 5 6 1 3 2 2 3 3 1 2 3 0 2 3 3B 4 5 6 7 2 3 3 3 4 3 2 3 3 0 3 4C 4 5 6 6 8 3 3 4 4 5 4 3 4 4 0 4A 4 4 6 7 6 9 4 3 4 5 5 5 4 5 5 0

D C A B C A

. . . | | |A D B B C A

A D B B C A

Contents

1 Introduction

2 Basic definitions

4 Conclusion

Overview

Pairwise sequence alignment is the process of comparing twostrings of letters to infer their similarity.

Overview

There exist two main distances for comparing two strings —the edit distance and the Hamming distance.

Overview

A different formulation of the edit distance is to maximize thesimilarity of the two strings — global alignment(Needleman-Wunsch algorithm) — instead of minimizing thedistance between the two strings.

Overview

Instead of considering a global alignment between two strings,in biological applications it is often more relevant to determinea best alignment between substrings of the two strings —local alignment (Smith-Waterman algorithm).

Overview

Instead of considering a global alignment between two strings,in biological applications it is often more relevant to determinea best alignment between substrings of the two strings —local alignment (Smith-Waterman algorithm).

The Hamming distance is a particular case of edit distance forwhich only the operation of substitution is considered.

Overview

N. Alachiotis, S. Berger, and A. Stamatakis.Coupling SIMD and SIMT architectures to boost performanceof a phylogeny-aware alignment kernel.BMC Bioinformatics, 13:196, 2012.

F. J. Damerau.A technique for computer detection and correction of spellingerrors.Commun. ACM, 7(3):171–176, 1964.

M. Farrar.Striped smith–waterman speeds database searches six timesover other simd implementations.Bioinformatics, 23(2):156–161, 2007.

O. Gotoh.An improved algorithm for matching biological sequences.Journal of molecular biology, 162(3):705–708, 1982.

Overview

S. Henikoff and J. G. Henikoff.Amino acid substitution matrices from protein blocks.Proceedings of the National Academy of Sciences of the

United States of America, 89(22):10915–10919, 1992.

D. S. Hirschberg.A linear space algorithm for computing maximal commonsubsequences.Commun. ACM, 18(6):341–343, 1975.

V. I. Levenshtein.Binary codes capable of correcting deletions, insertions, andreversals.Technical Report 8, Soviet Physics Doklady, 1966.

W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig.Streaming Algorithms for Biological Sequence Alignment onGPUs.IEEE Trans. Parallel Distrib. Syst., 18(9):1270–1281, 2007.

Overview

S. Manavski and G. Valle.CUDA compatible GPU cards as efficient hardware acceleratorsfor Smith-Waterman sequence alignment.BMC Bioinformatics, 9(S-2), 2008.

E. W. Myers and W. Miller.Optimal alignments in linear space.Computer Applications in the Biosciences, 4(1):11–17, 1988.

S. B. Needleman and C. D. Wunsch.A general method applicable to the search for similarities inthe amino acid sequence of two proteins.Journal of Molecular Biology, 48(3):443–453, 1970.

T. Rognes.Faster Smith-Waterman database searches with inter-sequenceSIMD parallelisation.BMC Bioinformatics, 12:221, 2011.

Overview

T. Rognes and E. Seeberg.Six-fold speed-up of Smith-Waterman sequence databasesearches using parallel processing on common microprocessors.Bioinformatics, 16(8):699–706, 2000.

D. Sankoff.Matching Sequences under Deletion/Insertion Constraints.Proceedings of the National Academy of Sciences of the

United States of America, 69(1):4–6, 1972.

P. H. Sellers.On the theory and computation of evolutionary distances.SIAM Journal on Applied Mathematics, 26(4):787–793, 1974.

T. Vintsyuk.Speech discrimination by dynamic programming.Cybernetics, 4:52–57, 1968.

M. S. Waterman and T. F. Smith.

Overview

Identification of common molecular subsequences.Journal of Molecular Biology, 147(1):195–197, 1981.

pairwise sequence alignment · introduction basic deﬁnitions alignment algorithms on strings...

Documents

an introduction to multiple sequence alignment — and...

multiple sequence alignment an introduction to...

a bit-parallel, general integer-scoring sequence alignment...

introduction sequence alignment - mit...

introduction to sequence alignment

multiple sequence alignment · multiple sequence alignment...

introduction to sequence similarity searches and sequence...

sequence alignment part 3 introduction to bioinformatics

lecture 1: course introduction; biological introduction ·...

pairwise sequence alignment · 2014-05-26 · september 6,...

multiple sequence alignment. multiple sequence alignment:...

sequence alignment:

sequence...

introduction to global and local sequence alignment methods

www.bioalgorithms.infoan introduction to bioinformatics...

introduction sequence alignment - mathematics › classes...

chapter 2: sequence alignment - columbia university ·...

4 - 1 chap 4 the sequence alignment problem. 4 - 2 the...

introduction to sequence alignment • the needleman-wunsch...

multiple sequence alignment (msa). ecole phylogénomique,...