sequence alignment (part 2)
TRANSCRIPT
Sequence alignment (part 2)
Local vs. global alignmentLinear gap penalties
Statistics of alignments
Lecture 3
6.095 / 6.895 – Computational Biology: Genomes, Networks, Evolution
Thursday Sept 15, 2005
Sequence alignment + dynamic programming
A C G T C A T C A
A C G T G A T C Amutation
A G T G T C A
A G T G T C A
deletion
A G T G T C AT
begin
end
A G T G T C ATinsertion
• The sequence alignment problem– Genomes change: mutation, insertions, deletions– Alignment: infer evolutionary events– Scoring metric reflects evolutionary properties
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAAGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
• Needleman-Wunsch algorithm– Local update rule: F(i,j) = max{up, left, diagonal}– Save choice pointers for traceback– Bottom-right corner gives optimal alignment score– Trace-back of pointers gives optimal path/alignment
A C G T C A T C ATAGTGTCA
AG
TC/G
TC
A
• Dynamic programming and sequence alignment– Alignment scores are additive: decomposable– Represent sub-problem scores in M(i,j) matrix– Duality between alignment and path through matrix
• Dynamic programming– Problems that can be decomposed into subparts– Identical sub-problems: reuse computation– Bottom-up approach: systematically fill table
Last time: Two extensions of the basic algorithm• Bounded-space computation
– Space: O(k*m)– Time: O(k*m), where k = radius explored– Heuristic
• Not guaranteed optimal answer• Works very well in practice
– Practical interest
• Linear-space computation– Save only one col / row / diag at a time– Computes optimal score easily– Recursive call modification allows traceback– Theoretical interest
• Effective running time slower• Optimal answer guaranteed
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Today: Variations on the theme
• Changing the DP update rules– Global vs. local alignments– Overlap detection and semi-global alignment
• Affine gap penalties– Augmenting the state-space– Linear, affine, piecewise linear, general gap penalty
• Statistical significance of alignment– Where does s(xi, yj) come from?– Are two aligned sequences actually related
Intro to Local Alignments• Statement of the problem
– A local alignment of strings s and tis an alignment of a substring of swith a substring of t
• Why local alignments?– Small domains of a gene may be only conserved portions– Looking for a small gene in a large chromosome (search)– Large segments often undergo rearrangements
t
s
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Global alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Local alignment B D A C
B
D
A
C
A B C D
A B C D
The local alignment problem
Given two strings x = x1……xM, y = y1……yN
Find substrings x’, y’ whose similarity (optimal global alignment value)is maximum
e.g. x = aaaacccccggggy = ttgcctgggaaccaacc
How can we use dynamic programming for local alignment ?
Global Alignment
Initialization:• Top left: 0Update Rule:A(i,j)=max{
• A(i-1 , j ) - 2• A( i , j-1) - 2• A(i-1 , j-1) ±1
}Termination:• Bottom right
- A G T
A
A
G
C
- 0 -2 -4 -6
-2 1 -1 -5
-4 1 0 -2
-6 -1 2 0
-8 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
Local Alignment
Initialization:• Top left: 0Update Rule:A(i,j)=max{
• A(i-1 , j ) - 2• A( i , j-1) - 2• A(i-1 , j-1) ±1• 0
}Termination:• Anywhere
- A G T
A
A
G
C
- 0 0 0 0
0 1 0 0
0 1 0 0
0 0 2 0
0 0 0 1
1
1
1
-1
-1
Local Alignment issues• Resolving ambiguities
– When following arrows back, one can stop at any of the zero entries. Only stop when no arrow leaves. Longest.
• Correctness sketch by induction– Assume we’ve correctly aligned up to (i,j)– Consider the four cases of our max computation– By inductive hypothesis recurse on (i-1,j-1), (i-1,j), (i,j-1)– Base case: empty strings are suffixes aligned optimally
• Time analysis– O(mn) time– O(mn) space, can be brought to O(m+n)
Best local alignment vs. all local alignments
Termination:
1. If we want the best local alignment…
FOPT = maxi,j F(i, j)
2. If we want all local alignments scoring > t
For all i, j find F(i, j) > t, and trace back
Global Alignment vs. Local alignment
Initialization: F(0, 0) = 0
Iteration:
F(i – 1, j) – dF(i, j) = max F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)
Termination: Bottom right
Initialization: F(0, j) = F(i, 0) = 0
Iteration:0
F(i, j) = max F(i – 1, j) – dF(i, j – 1) – dF(i – 1, j – 1) + s(xi, yj)
Termination: Anywhere
Needleman-Wunsch algorithm Smith-Waterman algorithm
Different types of overlaps
• Application: genome assembly– Search for consecutive sequence segments– Boundaries of compared sequences not fully known
A variant of the basic algorithm:
• Allow unlimited # of gaps in the beginning and end• No penalty for end gaps
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGCGCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
The Overlap Detection variant
Changes:
1. InitializationFor all i, j,
F(i, 0) = 0F(0, j) = 0
2. Terminationmaxi F(i, N)
FOPT = max maxj F(M, j)
x1 ……………………………… xM
y 1…
……
……
……
……
……
…y N
Scoring the gaps more accurately
Current model: Linear gap penalty function
Gap of length: 1 incurs penalty: gGap of length: k incurs penalty: k*g
However, in evolution, larger segments are gained and lostgaps usually occur in bunches
More accurate model: Convex gap penalty function:
γ(n):for all n, γ(n + 1) - γ(n) ≤ γ(n) - γ(n – 1)
γ(n)
γ(n)
General gap dynamic programming
Initialization: same
Iteration:F(i-1, j-1) + s(xi, yj)
F(i, j) = max maxk=0…i-1F(k,j) – γ(i-k) maxk=0…j-1F(i,k) – γ(j-k)
Termination: same
F(i,j)
Running Time: O(N2M) (cubic)Space: O(NM)
Can we do better?
Compromise: affine gaps
γ(n) = p + (n – 1)×q| |
gap gapopen extendpenalty penalty
• Gap open– To introduce the first gap, cost of introducing a DNA break– Fixed cost for opening a gap: p+q
• Gap extend– Larger segments more costly than smaller ones– Linear cost increment for increasing number of gaps: q
• How can we compute this using dynamic programming?– Achieve O(m*n) running time & space
pq
γ(n)
Additional Matrices
• The amount of state needed increases– In scoring a single entry in our matrix, we need
remember an extra piece of information• Are we continuing a gap in s? (if not, start is more expensive)• Are we continuing a gap in t? (if not, start is more expensive)• Are we continuing from a match between s(i) and t(j)?
• Dynamic programming framework– We encode this information in three different matrices– For each element (i,j) we use three variables
• a(i,j): best alignment of s[1..i] & t[1..j] that aligns s[i] with t[j]• b(i,j): best alignment of s[1..i] & t[1..j] that aligns gap with t[j]• c(i,j): best alignment of s[1..i] & t[1..j] that aligns s[i] with gap
Update rules
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
−−−−−−
+=)1,1()1,1()1,1(
max])[],[(),(jicjibjia
jtisscorejia
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
+−−−−
+−−=
)()1,()1,(
)()1,(max),(
qpjicqjib
qpjiajib
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
+−−−−
+−−=
)(),1(),1(
)(),1(max),(
qpjibqjic
qpjiajic
When s[j] and t[j] are aligned
When t[j] aligns with a gap in s
When s[i] aligns with a gap in t
starting a gap in s
extending a gap in s
Stopping a gap in t,and starting one in s
Find maximum over all three arrays max(a[m,n],b[m,n],c[m,n]). Follow arrows back, skipping from matrix to matrix
Simplified update rules
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
−−−−−−
+=)1,1()1,1()1,1(
max])[],[(),(jicjibjia
jtisscorejia
⎟⎟⎠
⎞⎜⎜⎝
⎛−−
+−−=
qjibqpjia
jib)1,(
)()1,(max),(
⎟⎟⎠
⎞⎜⎜⎝
⎛−−
+−−=
qjicqpjia
jic),1(
)(),1(max),(
When s[j] and t[j] are aligned
When t[j] aligns with a gap in s
When s[i] aligns with a gap in t
starting a gap in s
extending a gap in s
• Transitions from b to c are not necessary...…if the worst mismatch costs less than p+qACC-GGTAA--TGGTA
ACCGGTAA-TGGTA
Using a single gap matrix
Initialization: F(i, 0) = p + (i – 1)×qF(0, j) = p + (j – 1)×q
Iteration:F(i – 1, j – 1) + s(xi, yj)
F(i, j) = maxG(i – 1, j – 1) + s(xi, yj)
F(i – 1, j) – p F(i, j – 1) – p
G(i, j) = max G(i, j – 1) – qG(i – 1, j) – q
Termination: same
To generalize a little…
… think of how you would compute optimal alignment with this gap function
….in time O(m*n)
γ(n)
General Gap Penalty• Gap penalties are limited by the amount of state
– Linear gap penalty: w(k) = k*p• State: Current index tells if in a gap or not
– Affine gap penalty: w(k) = p + q*k, where q<p• State: add binary value for each sequence: starting a gap or not
– Quadratic: w(k) = p+q*k+rk2.• State: needs to encode the length of the gap, which can be O(n)• To encode it we need O(log n) bits of information. Not feasible
– Length (mod 3) gap penalty for protein-coding regions• Gaps of length divisible by 3 are penalized less: conserve frame• This is feasible, but requires more possible states• Possible states are: starting, mod 3=1, mod 3=2, mod 3=0
What have we learned so far?
• Sequence alignment and dynamic programming
• Different gap penalty functions
Global Local Semi-global
γ(n)
d eγ(n)
LinearLast lecture
Affine2nd matrix
γ(n)
GeneralO(m*n2)
γ(n)
Len (mod 3)Great Project
The statistics of alignments
Where does s(xi, yj) come from?Are two aligned sequences actually related?
Courtesy of National Center for Biotechnology Information.
Rare amino acids have high weights
Common amino acids have low weights
Positive for chemically similar substitution BLOSUM 62
Summary
• Dynamic Programming variations– Local alignment vs. global alignment– Semi-global alignment– Linear gap penalties
• Sequence alignment statistics– Computing the substitution scores– Protein alignment– BLOSUM62 matrix– Extreme value distribution
• Next lectures: – String matching in linear time!– Database search and hashing
What’s next
• Find the best match to a single gene in a genome
• Understand the correspondence of all genes
• Align two genes
• Evolutionary relationships of many genes
• Align many genes simultaneously
• Find the best match to a single gene in a genome– (Lecture 5: Database search)
• Understand the correspondence of all genes– (Lecture 24: Genome-scale comparative genomics)
• Align two genes– (Lectures 2 + 3: Sequence alignment)
• Evolutionary relationships of many genes– (Lecture 12: Phylogeny)
• Align many genes simultaneously– (Lecture 13: Profile alignment)