biometrics module code: ca641 week 11- pairwise sequence alignment

20
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

Upload: briana-dorsey

Post on 11-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

BIOMETRICS

Module Code: CA641

Week 11- Pairwise Sequence Alignment

Page 2: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

Outline

1. Why should one do sequence comparison?

2. Dynamic programming alignment algorithms

1. global alignment (Needleman-Wunsch)

2. local alignment (Smith-Waterman)

3. Heuristic alignment algorithms: FASTA, BLAST

Acknowledgement: this presentation was created following Prof. Liviu Ciortuz, “Computer Science” Faculty, “Al. I. Cuza” University, Iasi, Romania: http://profs.info.uaic.ro/~ciortuz/

2

Page 3: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

3

Why should one do sequence comparison?First success stories

• 1984 (Russell Doolittle): similarity between the newly discovered ν-sis onco-gene in monkeys and a previously known gene PDGF (platelet derived growth factor) involved in controlled growth of the cell, suggested that the onco-gene does the right job at the wrong time

• 1989: cystic fibrosis disease — an abnormally high content of sodium in the mucus in lungs:

• Biologists suggested that this is due to a mutant gene (still unknown at that time, but narrowed down to a 1MB region) on chromosome 7. Comparison with (then) known genes found similarity with a gene that encodes ATP , the adenosine triphosphate binding protein, which spans the cell membrane as an ion transport channel. Cystic fibrosis was then proved to be caused by the deletion of a triplet in the CFTR (cystic fibrosis transport regulator) gene.

Page 4: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

Model organism

• a non-human species - extensively studied to understand particular biological phenomena

• used to research human disease

4

Escherichia coli Drosophila melanogaster

Zea mays Laboratory mice

Image source: http://en.wikipedia.org/wiki/Model_organism

Page 5: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

5

How to rank alignments: Scoring systems Notations

• Consider 2 sequences x and y of lengths n and m , respectively; xi is the i-th symbol (“residue”) in x and yj is the j -th symbol in y

• For DNA, the alphabet is {A, C, G, T}. For proteins, the alphabet consists of 20 amino acids.

Page 6: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

Sequence matching/alignment • Given two DNA sequences or proteins

– X = x1x2x3..xm

– Y = y1y2y3…yn,

where xi, yi are from the alphabet Alf = {A, C, G, T}, what is their similarity?

• Seq. Alignment = a way to arrange the sequences of DNA, RNA or protein, i.e. regions of similarity can be identified

• Example of seq. alignment:

• Gaps may be opened to improve alignment

6

Sequences Alignment

ACCCGA ACCCGA

ACTA =>align AC - - TA

TCCTA TCC - TA

ACCGA ACCCGA

| | : : : : | | : : :

ACTA TCCTA

Page 7: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

7

A dot plot example

y\x H E A G A W G H E E

P

A • •

W •

H • •

E • • •

A • •

E • • •

Page 8: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

8

Gap Penalties• Types of gap penalties:– Linear score: ϒ (g) = -gd– Affine score: ϒ (g) = -d – (g-1)e

where:• d > 0 is the gap open penalty,• e> 0 is the gap extension penalty,• g N is the length of the gap.

• Note:• Usually e < d, therefore long insertions and deletions will be

penalised less than they would have been penalised by a linear gap cost.• Examples from this presentation use d = -8.

Page 9: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

9

DNA substitution matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Simple

A C G T

A 91 -114 -31 -123

C -114 100 -125 -31

G -31 -125 100 -114

T -123 -31 -114 91

Used in genome alignments

Page 10: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

10

Dynamic programming

• Better alignments will have a higher score. Therefore, the purpose is to maximize the score in order to find the optimal alignment.

• Sometimes scores are assigned by other means as costs or edit distances, when a minimum cost of an alignment is searched.

• Both approaches have been used in biological sequence comparison.

• Dynamic programming is applied to both cases: the difference is searching for a minimum or maximum score of the alignment.

Page 11: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

11

The score matrix for the following examples (from BLOSUM50)

Examples from this presentation use d = -8.

Page 12: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

12

Global Alignment:Needleman–Wunsch Algorithm (1970)

• Problem: Find the optimal global alignment of two sequences x = x1..xn, and y = y1..ym.

• Idea: Use dynamic programming.• Initialisation:

F(0,0) =0, F(i,0)=-id, F(0,j) = -jd, for i=1..n, j=1..m• Recurrence:

There are three ways in which the best score F(i,j) of an alignment up to xi, yj could be obtained:

F(i,j)= max

Fill in the matrix F from top left to bottom right.• Optimal score: F(n,m).

F(i-1, j-1) + score(xi, yj),

F(i-1, j) – d,

F(i, j-1)-d

Page 13: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

13

Filling in the dynamics programming matrix

F(i-1, j-1)Score(xi, yj)

F(i, j-1)

F(i-1, j) F(i,j)-d

-d

Page 14: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

14

Global Alignment: ExampleComputing the dynamic programming matrix

Page 15: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

15

Global Alignment: Example (cont’d)Finding an optimal alignment

Page 16: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

16

Local AlignmentSmith–Waterman Algorithm

• Problem: Find the best local alignment of two sequences x and y, allowing gaps.• Idea: Modify the global alignment algorithm in such a way that it will end in a

local alignment whenever the score becomes negative..• Initialisation:

F(0,0) =0, F(i,0)= F(0,j) = 0, for i=1..n, j=1..m• Recurrence:

There are three ways in which the best score F(i,j) of an alignment up to xi, yj could be obtained:

F(i,j)= max

Fill in the matrix F from top left to bottom right.

• Optimal score: maxi, j F(i,j).

0

F(i-1, j-1) + score(xi, yj),

F(i-1, j) – d,

F(i, j-1)-d

Page 17: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

17

Note

For the local alignment algorithm to work,

• Some (a,b) must be positive, otherwise the algorithm will not find any alignment at all;

• The expected score of a random alignment should be negative, otherwise

long matches between unrelated sequences will have the high score, just based on their length, and a true local alignment would be masked by a longer but incorrect one.

Page 18: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

18

Local Alignment: Example

Page 20: BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment

20

Biometric Workshop – page 11

• Q.3 Given 2 sequences

GCCGAGCTAG

CCTAGCAAG• Build the matrix of maximal similarity between them,

according to Smith-Waterman algorithm (local alignment), using the scores: match = 1; mismatch = -1; gap = -3.

• What is the optimal alignment?