sequence alignment (part 2)

47
Sequence alignment (part 2) Local vs. global alignment Linear gap penalties Statistics of alignments Lecture 3 6.095 / 6.895 – Computational Biology: Genomes, Networks, Evolution Thursday Sept 15, 2005

Upload: others

Post on 27-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Sequence alignment (part 2)

Local vs. global alignmentLinear gap penalties

Statistics of alignments

Lecture 3

6.095 / 6.895 – Computational Biology: Genomes, Networks, Evolution

Thursday Sept 15, 2005

Sequence alignment + dynamic programming

A C G T C A T C A

A C G T G A T C Amutation

A G T G T C A

A G T G T C A

deletion

A G T G T C AT

begin

end

A G T G T C ATinsertion

• The sequence alignment problem– Genomes change: mutation, insertions, deletions– Alignment: infer evolutionary events– Scoring metric reflects evolutionary properties

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAAGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

• Needleman-Wunsch algorithm– Local update rule: F(i,j) = max{up, left, diagonal}– Save choice pointers for traceback– Bottom-right corner gives optimal alignment score– Trace-back of pointers gives optimal path/alignment

A C G T C A T C ATAGTGTCA

AG

TC/G

TC

A

• Dynamic programming and sequence alignment– Alignment scores are additive: decomposable– Represent sub-problem scores in M(i,j) matrix– Duality between alignment and path through matrix

• Dynamic programming– Problems that can be decomposed into subparts– Identical sub-problems: reuse computation– Bottom-up approach: systematically fill table

Last time: Two extensions of the basic algorithm• Bounded-space computation

– Space: O(k*m)– Time: O(k*m), where k = radius explored– Heuristic

• Not guaranteed optimal answer• Works very well in practice

– Practical interest

• Linear-space computation– Save only one col / row / diag at a time– Computes optimal score easily– Recursive call modification allows traceback– Theoretical interest

• Effective running time slower• Optimal answer guaranteed

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Today: Variations on the theme

• Changing the DP update rules– Global vs. local alignments– Overlap detection and semi-global alignment

• Affine gap penalties– Augmenting the state-space– Linear, affine, piecewise linear, general gap penalty

• Statistical significance of alignment– Where does s(xi, yj) come from?– Are two aligned sequences actually related

Local Alignment

Intro to Local Alignments• Statement of the problem

– A local alignment of strings s and tis an alignment of a substring of swith a substring of t

• Why local alignments?– Small domains of a gene may be only conserved portions– Looking for a small gene in a large chromosome (search)– Large segments often undergo rearrangements

t

s

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Global alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Local alignment B D A C

B

D

A

C

A B C D

A B C D

The local alignment problem

Given two strings x = x1……xM, y = y1……yN

Find substrings x’, y’ whose similarity (optimal global alignment value)is maximum

e.g. x = aaaacccccggggy = ttgcctgggaaccaacc

How can we use dynamic programming for local alignment ?

Global Alignment

Initialization:• Top left: 0Update Rule:A(i,j)=max{

• A(i-1 , j ) - 2• A( i , j-1) - 2• A(i-1 , j-1) ±1

}Termination:• Bottom right

- A G T

A

A

G

C

- 0 -2 -4 -6

-2 1 -1 -5

-4 1 0 -2

-6 -1 2 0

-8 -1 0 1

1

1

1

-1

-1 -1

-1 -1

-1

-1

Local Alignment

Initialization:• Top left: 0Update Rule:A(i,j)=max{

• A(i-1 , j ) - 2• A( i , j-1) - 2• A(i-1 , j-1) ±1• 0

}Termination:• Anywhere

- A G T

A

A

G

C

- 0 0 0 0

0 1 0 0

0 1 0 0

0 0 2 0

0 0 0 1

1

1

1

-1

-1

Local Alignment issues• Resolving ambiguities

– When following arrows back, one can stop at any of the zero entries. Only stop when no arrow leaves. Longest.

• Correctness sketch by induction– Assume we’ve correctly aligned up to (i,j)– Consider the four cases of our max computation– By inductive hypothesis recurse on (i-1,j-1), (i-1,j), (i,j-1)– Base case: empty strings are suffixes aligned optimally

• Time analysis– O(mn) time– O(mn) space, can be brought to O(m+n)

Best local alignment vs. all local alignments

Termination:

1. If we want the best local alignment…

FOPT = maxi,j F(i, j)

2. If we want all local alignments scoring > t

For all i, j find F(i, j) > t, and trace back

Global Alignment vs. Local alignment

Initialization: F(0, 0) = 0

Iteration:

F(i – 1, j) – dF(i, j) = max F(i, j – 1) – d

F(i – 1, j – 1) + s(xi, yj)

Termination: Bottom right

Initialization: F(0, j) = F(i, 0) = 0

Iteration:0

F(i, j) = max F(i – 1, j) – dF(i, j – 1) – dF(i – 1, j – 1) + s(xi, yj)

Termination: Anywhere

Needleman-Wunsch algorithm Smith-Waterman algorithm

Semi-global alignment and

overlap detection

Different types of overlaps

• Application: genome assembly– Search for consecutive sequence segments– Boundaries of compared sequences not fully known

A variant of the basic algorithm:

• Allow unlimited # of gaps in the beginning and end• No penalty for end gaps

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGCGCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

The Overlap Detection variant

Changes:

1. InitializationFor all i, j,

F(i, 0) = 0F(0, j) = 0

2. Terminationmaxi F(i, N)

FOPT = max maxj F(M, j)

x1 ……………………………… xM

y 1…

……

……

……

……

……

…y N

Affine gap penalties

Scoring the gaps more accurately

Current model: Linear gap penalty function

Gap of length: 1 incurs penalty: gGap of length: k incurs penalty: k*g

However, in evolution, larger segments are gained and lostgaps usually occur in bunches

More accurate model: Convex gap penalty function:

γ(n):for all n, γ(n + 1) - γ(n) ≤ γ(n) - γ(n – 1)

γ(n)

γ(n)

General gap dynamic programming

Initialization: same

Iteration:F(i-1, j-1) + s(xi, yj)

F(i, j) = max maxk=0…i-1F(k,j) – γ(i-k) maxk=0…j-1F(i,k) – γ(j-k)

Termination: same

F(i,j)

Running Time: O(N2M) (cubic)Space: O(NM)

Can we do better?

Compromise: affine gaps

γ(n) = p + (n – 1)×q| |

gap gapopen extendpenalty penalty

• Gap open– To introduce the first gap, cost of introducing a DNA break– Fixed cost for opening a gap: p+q

• Gap extend– Larger segments more costly than smaller ones– Linear cost increment for increasing number of gaps: q

• How can we compute this using dynamic programming?– Achieve O(m*n) running time & space

pq

γ(n)

Additional Matrices

• The amount of state needed increases– In scoring a single entry in our matrix, we need

remember an extra piece of information• Are we continuing a gap in s? (if not, start is more expensive)• Are we continuing a gap in t? (if not, start is more expensive)• Are we continuing from a match between s(i) and t(j)?

• Dynamic programming framework– We encode this information in three different matrices– For each element (i,j) we use three variables

• a(i,j): best alignment of s[1..i] & t[1..j] that aligns s[i] with t[j]• b(i,j): best alignment of s[1..i] & t[1..j] that aligns gap with t[j]• c(i,j): best alignment of s[1..i] & t[1..j] that aligns s[i] with gap

Update rules

⎟⎟⎟

⎜⎜⎜

−−−−−−

+=)1,1()1,1()1,1(

max])[],[(),(jicjibjia

jtisscorejia

⎟⎟⎟

⎜⎜⎜

+−−−−

+−−=

)()1,()1,(

)()1,(max),(

qpjicqjib

qpjiajib

⎟⎟⎟

⎜⎜⎜

+−−−−

+−−=

)(),1(),1(

)(),1(max),(

qpjibqjic

qpjiajic

When s[j] and t[j] are aligned

When t[j] aligns with a gap in s

When s[i] aligns with a gap in t

starting a gap in s

extending a gap in s

Stopping a gap in t,and starting one in s

Find maximum over all three arrays max(a[m,n],b[m,n],c[m,n]). Follow arrows back, skipping from matrix to matrix

Simplified update rules

⎟⎟⎟

⎜⎜⎜

−−−−−−

+=)1,1()1,1()1,1(

max])[],[(),(jicjibjia

jtisscorejia

⎟⎟⎠

⎞⎜⎜⎝

⎛−−

+−−=

qjibqpjia

jib)1,(

)()1,(max),(

⎟⎟⎠

⎞⎜⎜⎝

⎛−−

+−−=

qjicqpjia

jic),1(

)(),1(max),(

When s[j] and t[j] are aligned

When t[j] aligns with a gap in s

When s[i] aligns with a gap in t

starting a gap in s

extending a gap in s

• Transitions from b to c are not necessary...…if the worst mismatch costs less than p+qACC-GGTAA--TGGTA

ACCGGTAA-TGGTA

Using a single gap matrix

Initialization: F(i, 0) = p + (i – 1)×qF(0, j) = p + (j – 1)×q

Iteration:F(i – 1, j – 1) + s(xi, yj)

F(i, j) = maxG(i – 1, j – 1) + s(xi, yj)

F(i – 1, j) – p F(i, j – 1) – p

G(i, j) = max G(i, j – 1) – qG(i – 1, j) – q

Termination: same

To generalize a little…

… think of how you would compute optimal alignment with this gap function

….in time O(m*n)

γ(n)

General Gap Penalty• Gap penalties are limited by the amount of state

– Linear gap penalty: w(k) = k*p• State: Current index tells if in a gap or not

– Affine gap penalty: w(k) = p + q*k, where q<p• State: add binary value for each sequence: starting a gap or not

– Quadratic: w(k) = p+q*k+rk2.• State: needs to encode the length of the gap, which can be O(n)• To encode it we need O(log n) bits of information. Not feasible

– Length (mod 3) gap penalty for protein-coding regions• Gaps of length divisible by 3 are penalized less: conserve frame• This is feasible, but requires more possible states• Possible states are: starting, mod 3=1, mod 3=2, mod 3=0

What have we learned so far?

• Sequence alignment and dynamic programming

• Different gap penalty functions

Global Local Semi-global

γ(n)

d eγ(n)

LinearLast lecture

Affine2nd matrix

γ(n)

GeneralO(m*n2)

γ(n)

Len (mod 3)Great Project

The statistics of alignments

Where does s(xi, yj) come from?Are two aligned sequences actually related?

Courtesy of National Center for Biotechnology Information.

Rare amino acids have high weights

Common amino acids have low weights

Positive for chemically similar substitution BLOSUM 62

Summary

• Dynamic Programming variations– Local alignment vs. global alignment– Semi-global alignment– Linear gap penalties

• Sequence alignment statistics– Computing the substitution scores– Protein alignment– BLOSUM62 matrix– Extreme value distribution

• Next lectures: – String matching in linear time!– Database search and hashing

What’s next

• Find the best match to a single gene in a genome

• Understand the correspondence of all genes

• Align two genes

• Evolutionary relationships of many genes

• Align many genes simultaneously

• Find the best match to a single gene in a genome– (Lecture 5: Database search)

• Understand the correspondence of all genes– (Lecture 24: Genome-scale comparative genomics)

• Align two genes– (Lectures 2 + 3: Sequence alignment)

• Evolutionary relationships of many genes– (Lecture 12: Phylogeny)

• Align many genes simultaneously– (Lecture 13: Profile alignment)

Project ideas

• Coding-centric alignment

• Alignment with inversions

• Consensus-word multiple alignment

• Alignment with multiple seeds

• Multiple progressive alignment

• Statistics of genome alignment