sequence alignment - unibo.it · sequence alignment. sequence alignment ... (optimal global...

62
Sequence Alignment

Upload: phungdat

Post on 13-Apr-2018

315 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Sequence Alignment

Page 2: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- || ||||||| |||| | || ||| |||||TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 3: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Distance from sequencesDistance from sequences

Mutation events

Mutationwe need a score for the substitution of the symbol i with j (amino acidic residues, nucleotides, etc.)

substitution matrices s(i,j)

A: ALASVLIRLITRLYP B: ASAVHLNRLITRLYP

The substitution matrix should account for the underlying biological process (conserved structures or functions)

Page 4: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Substitution matricesSubstitution matrices

The basic idea is to measure the correlation between 2 sequences

Given a pair of “correlated” sequences we measure the substitution frequency of a-> b ( assuming symmetry) Pab

The null hypothesis (random correlation or independent events) is the product PaPb

We then define s (a,b) = log(Pab/PaPb)

(log is additive so that the probability => sum over pairs)

Minimal distance = Maximum score (s)

Page 5: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Following this idea several matrices have been derivedTheir main difference is relative to the alignment types used for computing the frequencies

PAMx: (Point Accepted Mutation). Number x % of mutation events.The matrix is built as:

A1ik = P(k|i) for sequences with 1% mutations.

Anik=(A1

ik)n n % mutation events (number of mutations NOT mutated residues.

E.g.: n=2 P(i|k) = aP(i|a) P(a|k)

PAMn = Log(Anij /Pi)

Substitution matricesSubstitution matrices

Page 6: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

PAM10PAM10

Very stringent matrix, no out of diagonal element is > 0

Page 7: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

PAM160PAM160

Some positive values outside the diagonal appear

Page 8: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

PAM250PAM250

This is one of the most widely used matrix

Page 9: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

PAM500PAM500

Page 10: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

PAM matrices are computed starting from very similar sequences and more distance scoring matrices are derived by matrix products

BLOSUMx: This family is computed directly from blocks of alignments with a defined sequence identity threshold (> x%).

=> For very related sequences we must use PAM with low numbers and BLOSUM with large numbers. The opposite holds for distant sequences

Substitution matricesSubstitution matrices

Page 11: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

BLOSUM62BLOSUM62

Probably the most widely used

Page 12: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

BLOSUM90BLOSUM90

Page 13: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

BLOSUM30BLOSUM30

Page 14: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Exercise:Exercise:

Write your own substitution Matrix:Given e set of aligned sequences compute thePab=frequency of mutations between a->b (assume symmetry, a->b counts also as b->a).Pa=as the marginal probability of Pab

Finally, derive the substitution matrix: s (a,b) = log(Pab/PaPb)Use the files provided at:http://www.biocomp.unibo.it/piero/P4B/Malignments/

Hints: 1.start with toy.aln to check your algorithm2. Initialize your P[a][b] matrix with pseudocounts (such as 1 instead of 0)

Page 15: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

DotPlots:DotPlots:

Page 16: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

DotPlot Exercise:DotPlot Exercise:

Write a program dotplot.pyThat takes as input a fasta file with two sequencesA scoring matrix and sliding window and the threshold cut-off, such as:

dotplot.py fasta.in score.mat 11 threshold

and prints on the standard oputput the dotplot.

PS: you can use matplotlib for a nicer presentation

Page 17: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Distance between sequencesDistance between sequences

What kind of operations we consider?

Mutation Deletion and Insertion(rarely rearrangements )

A: ALASVLIRLIT--YP B: ASAVHL---ITRLYP

The negative gap score value depends only on the number of holes(n) = -nd linear(n) = -d - (n-1)e affine (d: open e: extension)

!! Please note that all scores are position-independent along the sequence

Page 18: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Pairwise alignmentPairwise alignment

Given 2 sequences what is an alignment with a maximum score?

Naïf solution: try every possible alignments and select one with the best score

Using the score function :

Page 19: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

How may alignments there are?How may alignments there are?

Case without internal gaps

­­AAA   ­AAA  AAA  AAA  AAA­   AAA­­BB­­­     BB­­  BB­  ­BB  ­­BB   ­­­BB

Given 2 sequences of length m e n, the number of shifts is m +n +1

Page 20: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Case with internal gaps

­­AAA   ­AAA   ­AAA   ­AAA   A­AABB­­­  BB­­   B­B­   B­­B   BB­­BBAAA   BABAA  BAABA  BAAAB  ABBAA

AAA    AAA    AA­A   AAA    AAA­BB­    B­B    ­BB­   ­BB    ­­BB  ...ABABA  ABAAB  AABBA  AABAB  AAABB

The number of the alignments is equal at the number of all possible way of mixing 2 sequences keeping track of the original sequence order. Given 2 sequences of lengths n and m, they are ∑k=0,min(m,n)(m+n-k)!/k!(n-k)!(m-k)!

For n=m=80we get > 1043 possible alignments !!!!!!!

How may alignments there are?How may alignments there are?

Page 21: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

We can keep a table of the precomputed substring alignments (dynamic programming)

ALSKLASPALSAKDLDSPALSALSKIADSLAPIKDLSPASLT

ALSKLASPALSAKDLDSPAL-SALSKIADSLAPIKDLSPASLT-

ALSKLASPALSAKDLDSPALS-ALSKIADSLAPIKDLSPASL-T

Basic ideaBasic idea

Page 22: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Building step by stepGiven the 2 sequences

ALSKLASPALSAKDLDSPALS, ALSKIADSLAPIKDLSPASLT

The best alignment between the two substrings

ALSKLASPA ALSKIAD

can be computed taking only into consideration

ALSKLASP A ALSKLASP A ALSKLASPA - ALSKIA D ALSKIAD - ALSKIA D

The best among these 3 possibilities

Basic ideaBasic idea

+ + +

Page 23: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

The Needleman-Wunsch Matrixx1 ……………………………… xM

yN …

……

……

……

……

……

y1

Every nondecreasing path

from (0,0) to (M, N)

corresponds to an alignment of the two sequences

Page 24: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

The Needleman-Wunsch Algorithm

x = AGTA m = 1

y = ATA s = -1

d = -1

20-1-1-3A

0100-2T

-2-101-1A

-4-3-2-10

ATGA

F(i,j) i = 0 1 2 3 4

j = 0

1

2

3

Optimal Alignment:

F(4,3) = 2

AGTAA - TA

Page 25: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

The Needleman-Wunsch Algorithm

• Initialization.• F(0, 0) = 0• F(0, j) = - j d• F(i, 0) = - i d

• Main Iteration. Filling-in partial alignments• For each i = 1……M

For each j = 1……N F(i-1,j) – d [case 1]

F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + s(xi, yj) [case 3]

UP, if [case 1]Ptr(i,j) = LEFT if [case 2]

DIAG if [case 3]• Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace

back optimal alignment

Page 26: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Exercise:

Suppose you want only to know the score of a global alignment.

=> Write a program that given two input sequence (in a single file in fasta format), a gap cost and a similarity matrix computes the score of the global alignment in O(N*M) time and in O(M) space, where M and N are the lengths of the input sequences and M<=N

Page 27: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

The Overlap Detection variant

Changes:

• Initialization

For all i, j,

F(i, 0) = 0

F(0, j) = 0

• Termination

maxi F(i, N)

FOPT = max maxj F(M, j)

x1 ……………………………… xN

yM …

……

……

……

……

……

y1

Page 28: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

The local alignment problem

Given two strings x = x1……xM,

y = y1……yN

Find substrings x’, y’ whose similarity (optimal global alignment value)is maximum

e.g. x = aaaacccccggggy = cccgggaaccaacc

Page 29: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Why local alignment• Genes are shuffled between genomes

• Portions of proteins (domains) are often conserved

Page 30: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

The Smith-Waterman algorithm

Idea: Ignore badly aligning regions

Modifications to Needleman-Wunsch:

Initialization: F(0, j) = F(i, 0) = 0

0

Iteration: F(i, j) = max F(i – 1, j) – d

F(i, j – 1) – d

F(i – 1, j – 1) + s(xi, yj)

Page 31: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

The Smith-Waterman algorithm

Termination:

• If we want the best local alignment…

FOPT = maxi,j F(i, j)

• If we want all local alignments scoring > t

For all i, j find F(i, j) > t, and trace back

Page 32: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Scoring the gaps more accurately

Current model:

Gap of length nincurs penalty nd

However, gaps usually occur in bunches

Concave gap penalty function:

G(n):for all n, G(n + 1) - G(n) G(n) - G(n – 1)

G(n)

G(n)

Page 33: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

General gap dynamic programming

Initialization: same

Iteration:

F(i-1, j-1) + s(xi, yj)

F(i, j) = max maxk=0…i-1F(k,j) – (i-k)

maxk=0…j-1F(i,k) – (j-k)

Termination: same

Running Time: O(N2M) (assume N>M)

Space: O(NM)

Page 34: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Exercise:

Write a program that given two input sequence (in a single file in fasta format), and a choice of a general gap function and scoring matrix computes the alignments of the two sequences and returns one of the possible best alignments.

Remember that when you store that the best score is obtained using

maxk=0…i-1F(k,j) – g(i-k)

maxk=0…j-1F(i,k) – g(j-k)

You have to store this information in the corresponding pointer (back-trace) matrix.

Page 35: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Compromise: affine gaps

g(n) = d + (n – 1)e | |gap gapopen extend

To compute optimal alignment,

At position i,j, need to “remember” best score if gap is open best score if gap is not open

F(i, j): score of alignment x1…xi to y1…yj

ifif xi aligns to yj

G(i, j): score ifif xi, or yj, aligns to a gap

de

g(n)

Page 36: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Needleman-Wunsch with affine gaps

Initialization: F(0,0)=0, F(i, 0) = d + (i – 1)e, F(0, j) = d + (j – 1)eR(0,j)= -∞ , C(i,0)= -∞

Iteration: F(i – 1, j – 1) + s(xi, yj)

F(i, j) = max R(i , j) C(i , j)

F(i – 1, j) – d R(i, j) = max

R(i – 1, j) – e

F(i , j -1) – d C(i, j) = max

C(i , j -1 ) – e

Termination: same

Page 37: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Bounded Dynamic Programming

Assume we know that x and y are very similar

Assumption: # gaps(x, y) < k(N) ( say N>M )

xi Then, | implies | i – j | < k(N)

yj

We can align x and y more efficiently:

Time, Space: O(N k(N)) << O(N2)

Page 38: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Bounded Dynamic ProgrammingInitialization:

F(i,0), F(0,j) undefined for i, j > k

Iteration:

For i = 1…M

For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ s(xi, yj)

F(i, j) = max F(i, j – 1) – d, if j > i – k(N)

F(i – 1, j) – d, if j < i + k(N)

Termination: same

x1 ………………………… xM

y1 …

……

……

……

……

yN

k(N)Easy to extend to the affine gap case

Page 39: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Significant alignments

Significance of an alignmentSignificance of an alignment

Given an alignment with score S, is it significant?

How are the score of random alignments distributed?

100,000 alignments of unrelated and shuffled sequences:

Score

Occ

orre

nza

Page 40: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Z=(S-<S>)/s

S= Alignment score<S>= average of the scores on a random set of alignments s= Standard deviation of the scores on a random set of alignments

Significance of the alignment

Z<3 not significant3<Z<10 probably significantZ>10 significant

Z-scoreZ-score

Page 41: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Write a program that takes in input a fasta with two sequences, and a number N.Compute the score of the global alignment of the two sequence and the Z-score with respect N shuffled sequences (generated from the first of the fasta) against the original second sequence of the fasta.Z=(S-<S>)/s

S= Alignment score<S>= average of the scores on a random set of alignments s Standard deviation of the scores on a random set of alignments

Execise: Z-scoreExecise: Z-score

Page 42: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

The Z-score of this alignment is 7.5 over 54 residuesSequence identity is as high as 25.9%. The sequences have a completely different structure

Citrate synthase (2cts) vs transthyritin (2paba)

Is the Z-score reliable?Is the Z-score reliable?

Page 43: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

E-valueE-value

Expected number of random alignments obtaining a score greater or equal to a given score (s)

Relies on the Extreme Value Distribution

E=Kmn e-s

m, n: lengths of the sequencesK, : “scaling” constants

The number of high scoring random alignment increases when the sequence lengths increase and decreases in an exponential way when the score increases.

Page 44: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Alignment significance The significance of the E-value depends on the length of the considered database. Considering Swiss Prot,

E> 10-1 non significant10-1 > E > 10-3 probably not significant10 -3 > E > 10-8 probably significantE < 10-8 significant

E-valueE-value

Page 45: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

P-valueP-value

Probability for random alignments to obtain a score greater or equal to a given score (s)

Given the E-value (expected number of alignments), which statistics do describe the probability of having a number a of random alignments with score ≥ S?

Poisson: P(a) =

Which is the probability of finding at least one random alignment with score ≥ S?

P(a ≥ 1) = 1 – P(0) = 1 – exp (-E)

Page 46: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Similarity search in Data BasesSimilarity search in Data Bases

Given a sequence, search for similar sequences in huge data sets

In principle, the alignments between the query sequence and ALL the sequences in the data sets could be tried

Too many sequences!

Heuristic algorithms can be used. They do not assure to find the optimal alignment

FASTABLAST

Page 47: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

FASTAFASTA

The query sequence is chopped in words of k-tup characters. Usually k-tup = 2 for proteins, 6 for DNA

ADKLPTLPLRLDPTNMVFGHLRI

Words (indexed by position):AD, DK, KL, LP, PT, TL, LP, PR, RL, …,…,1 2 3 4 5 6 7 8 9 ….

The list of indexed words is compiled for each sequence in the data set (subject)

The search of the correspondence between the words is very fast.The difference between the indexes of the matches in the query and the subject sequences determines the distance from the main diagonal

Page 48: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

FASTAFASTA

Query

Sub

ject

Many matches along the same diagonal correspond to longer identical segments along the sequences

Page 49: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

FASTAFASTA

Query

Sub

ject

The alignment of the longest matched diagonals are evaluated with a score matrix (PAM or BLOSUM)

Page 50: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

FASTAFASTA

Query

Sub

ject

Most similar regions on close diagonals are isolated

Page 51: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Query

Sub

ject

FASTAFASTA

An exact Smith-Waterman alignment is computed on a narrow band around the diagonal endowed with the highest similarity (a 32-residue band is usually adopted)

Page 52: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Sequence similarity with FASTA

Page 53: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

BLASTBLAST

The sequence data set is indexed as follow:for each possible residue triplet the occurrence and the position along the sequences are stored. (FORMATDB)

AAAAACAADACA.........

Page 54: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

BLASTBLAST

The query sequence is chopped in words, W-residue long (usually W=3 for proteins)

LSHLPTLPLRLDPTNMVFGHLRI

LSH, SHL, HLP, LPT, PTL, TLR, …,…,

For each word, all the similar proteins are generated, using the BLOSUM62 matrix and setting a similarity threshold (usually T = 11--13)

LSH 16 ISH 14MSH 14VSH 13LAH 13LTH 13LNH 13

Page 55: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

BLASTBLAST

Each word included in the list of the similar words is retrieved in the sequence of the data set by means of the indexes

The match is extended, until the score is higher than a threshold S

Page 56: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Sequence similarity with BLAST (Basic Local Alignment Search Tool)

Page 57: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Alignment of all the retrieved sequencesAlignment of all the retrieved sequences

ATTENTION: It is not a multiple sequence alignment

Page 58: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

Position

Sequence profileSequence profile

Page 59: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Usefulness of the sequence profilesUsefulness of the sequence profiles

Sequence profiles describes the basic features of all the sequence used in the alignment

Most conserved regions and most frequent mutations for each position are highlighted

Sequence-to-profile alignment

The alignment score are weighted position by position using the profile. The same mutations in different positions are scored with different values

Page 60: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

Sequence-to-profile alignmentSequence-to-profile alignment

Given the position i along a sequence profile, it is represented by a 20-valued vector Pi = Pi(A) Pi(C) …… Pi (Y)

Given the residue in position j along the sequence to align: Sj

The score for aligning Sj to the vector Pi is:

where M is a matrix score (BLOSUM or PAM)

•How to score a profile-profile alignment?

Page 61: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

PSI-BLASTPSI-BLAST

http://www.ncbi.nlm.nih.gov/BLAST/

Sequence

Data Base

BLAST

Sequence profile

PSI-BLAST

Until converges

Page 62: Sequence Alignment - unibo.it · Sequence Alignment. Sequence Alignment ... (optimal global alignment value) is maximum e.g. x = aaaacccccgggg y = cccgggaaccaacc. Why local alignment

• PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program

• The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions

• The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly.

• PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.

• Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

The design of PSI-BLAST