sequence alignment. complete dna sequences more than 1000 complete genomes have been sequenced
Post on 19-Dec-2015
232 views
TRANSCRIPT
Sequence Alignment
Complete DNA Sequences
More than 1000 complete genomes have been sequenced
Comparative Genomics
Evolution at the DNA level
…ACGGTGCAGTTACCA…
…AC----CAGTCCACCA…
Mutation
SEQUENCE EDITS
REARRANGEMENTS
Deletion
InversionTranslocationDuplication
Evolutionary Rates
OK
OK
OK
X
X
Still OK?
next generation
Sequence conservation implies function
Alignment is the key to• Finding important regions• Determining function• Uncovering evolutionary events
Sequence Alignment
Sequence Alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gapin the other sequence
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
What is a good alignment?
AGGCTAGTT, AGCGAAGTTT
AGGCTAGTT- 6 matches, 3 mismatches, 1 gap
AGCGAAGTTT
AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps
AG-CGAAGTTT
AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps
AG-CG-AAGTTT
Scoring Function
• Sequence edits:AGGCCTC
Mutations AGGACTC
Insertions AGGGCCTC
Deletions AGG . CTC
Scoring Function:Match: +mMismatch: -sGap: -d
Score F = (# matches) m - (# mismatches) s – (#gaps) d
Alternative definition:
minimal edit distance
“Given two strings x, y,find minimum # of edits (insertions, deletions,
mutations) to transform one string to the other”
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Too many possible alignments:
>> 2N
(exercise)
Alignment is additive
Observation:The score of aligning x1……xM
y1……yN
is additive
Say that x1…xi xi+1…xM aligns to y1…yj yj+1…yN
The two scores add up:
F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])
Dynamic Programming
• There are only a polynomial number of subproblems Align x1…xi to y1…yj
• Original problem is one of the subproblems Align x1…xM to y1…yN
• Each subproblem is easily solved from smaller subproblems We will show next
• Then, we can apply Dynamic Programming!!!
Let F(i, j) = optimal score of aligning
x1……xi
y1……yj
F is the DP “Matrix” or “Table”
“Memoization”
Dynamic Programming (cont’d)
Notice three possible cases:
1. xi aligns to yj
x1……xi-1 xi
y1……yj-1 yj
2. xi aligns to a gapx1……xi-1 xi
y1……yj -
3. yj aligns to a gapx1……xi -y1……yj-1 yj
m, if xi = yj
F(i, j) = F(i – 1, j – 1) + -s, if
not
F(i, j) = F(i – 1, j) – d
F(i, j) = F(i, j – 1) – d
Dynamic Programming (cont’d)
How do we know which case is correct?
Inductive assumption:F(i, j – 1), F(i – 1, j), F(i – 1, j – 1) are optimal
Then, F(i – 1, j – 1) + s(xi, yj)
F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d
Where s(xi, yj) = m, if xi = yj; -s, if not
G -
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
F(i,j) i = 0 1 2 3 4
Example
x = AGTA m = 1y = ATA s = -1
d = -1
j = 0
1
2
3
F(1, 1) = max{F(0,0) + s(A, A), F(0, 1) – d, F(1, 0) – d} =
max{0 + 1, -1 – 1, -1 – 1} = 1
AA
TT
AA
Procedure to output Alignment
• Follow the backpointers
• When diagonal,OUTPUT xi, yj
• When up,OUTPUT yj
• When left,OUTPUT xi
The Needleman-Wunsch Matrix
x1 ……………………………… xMy1 …
……
……
……
……
……
… y
N
Every nondecreasing path
from (0,0) to (M, N)
corresponds to an alignment of the two sequences
An optimal alignment is composed of optimal subalignments
The Needleman-Wunsch Algorithm
1. Initialization.a. F(0, 0) = 0b. F(0, j) = - j dc. F(i, 0) = - i d
2. Main Iteration. Filling-in partial alignmentsa. For each i = 1……M
For each j = 1……N F(i – 1,j – 1) + s(xi, yj)
[case 1]F(i, j) = max F(i – 1, j) – d [case
2] F(i, j – 1) – d [case
3]
DIAG, if [case 1]Ptr(i, j) = LEFT, if [case 2]
UP, if [case 3]
3. Termination. F(M, N) is the optimal score, andfrom Ptr(M, N) can trace back optimal alignment
Performance
• Time:O(NM)
• Space:O(NM)
• Later we will cover more efficient methods
A variant of the basic algorithm:
• Maybe it is OK to have an unlimited # of gaps in the beginning and end:
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC ||||||| |||| | || ||GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
• Then, we don’t want to penalize gaps in the ends
Different types of overlaps
Example:2 overlapping“reads” from a sequencing project – recall Lecture 1
Example:Search for a mouse genewithin a human chromosome
The Overlap Detection variant
Changes:
1. InitializationFor all i, j,
F(i, 0) = 0F(0, j) = 0
2. Termination maxi F(i, N)
FOPT = max maxj F(M, j)
x1 ……………………………… xM
y1 …
……
……
……
……
……
…
yN
The local alignment problem
Given two strings x = x1……xM,
y = y1……yN
Find substrings x’, y’ whose similarity (optimal global alignment value)is maximum
x = aaaacccccggggttay = ttcccgggaaccaacc
Why local alignment – examples
• Genes are shuffled between genomes
• Portions of proteins (domains) are often conserved
Cross-species genome similarity
• 98% of genes are conserved between any two mammals• >70% average similarity in protein sequence
hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174 hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174
hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174 hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174
“atoh” enhancer in human, mouse, rat, fugu fish
The Smith-Waterman algorithm
Idea: Ignore badly aligning regions
Modifications to Needleman-Wunsch:
Initialization: F(0, j) = F(i, 0) = 0
0
Iteration: F(i, j) = max F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)
The Smith-Waterman algorithm
Termination:
1. If we want the best local alignment…
FOPT = maxi,j F(i, j)
Find FOPT and trace back
2. If we want all local alignments scoring > t
?? For all i, j find F(i, j) > t, and trace back?
Complicated by overlapping local alignments
Waterman–Eggert ’87: find all non-overlapping local alignments with minimal recalculation of the DP matrix
Scoring the gaps more accurately
Current model:
Gap of length nincurs penalty nd
However, gaps usually occur in bunches
Convex gap penalty function:
(n):for all n, (n + 1) - (n) (n) - (n – 1)
(n)
(n)
Convex gap dynamic programming
Initialization: same
Iteration:
F(i – 1, j – 1) + s(xi, yj)
F(i, j) = max maxk=0…i-1F(k, j) – (i – k)
maxk=0…j-1F(i, k) – (j – k)
Termination: same
Running Time: O(N2M) (assume N>M)
Space: O(NM)
Compromise: affine gaps
(n) = d + (n – 1)e | | gap gap open extend
To compute optimal alignment,
At position i, j, need to “remember” best score if gap is open best score if gap is not open
F(i, j): score of alignment x1…xi to y1…yj
if xi aligns to yj
G(i, j): score if xi aligns to a gap after yj
H(i, j): score if yj aligns to a gap after xi
V(i, j) = best score of alignment x1…xi to y1…yj
de
(n)
Needleman-Wunsch with affine gaps
Why do we need matrices F, G, H?
• xi aligns to yj
x1……xi-1 xi xi+1
y1……yj-1 yj -
• xi aligns to a gap after yj
x1……xi-1 xi xi+1
y1……yj …- -
Add -d
Add -e
G(i+1, j) = F(i, j) – d
G(i+1, j) = G(i, j) – e
Because, perhaps
G(i, j) < V(i, j)
(it is best to align xi to yj if we were aligningonly x1…xi to y1…yj and not the rest of x, y),
but on the contrary
G(i, j) – e > V(i, j) – d
(i.e., had we “fixed” our decision that xi alignsto yj, we could regret it at the next step whenaligning x1…xi+1 to y1…yj)
Needleman-Wunsch with affine gaps
Initialization: V(i, 0) = d + (i – 1)eV(0, j) = d + (j – 1)e
Iteration:V(i, j) = max{ F(i, j), G(i, j), H(i, j) }
F(i, j) = V(i – 1, j – 1) + s(xi, yj)
V(i – 1, j) – d G(i, j) = max
G(i – 1, j) – e
V(i, j – 1) – d H(i, j) = max
H(i, j – 1) – e
Termination: V(i, j) has the best alignmentTime?
Space?
To generalize a bit…
… think of how you would compute optimal alignment with this gap function
….in time O(MN)
(n)
Bounded Dynamic Programming
Assume we know that x and y are very similar
Assumption: # gaps(x, y) < k(N)
xi Then, | implies | i – j | < k(N)
yj
We can align x and y more efficiently:
Time, Space: O(N k(N)) << O(N2)
Bounded Dynamic Programming
Initialization:
F(i,0), F(0,j) undefined for i, j > k
Iteration:
For i = 1…M
For j = max(1, i – k)…min(N, i+k)
F(i – 1, j – 1)+ s(xi, yj)
F(i, j) = max F(i, j – 1) – d, if j > i – k(N)
F(i – 1, j) – d, if j < i + k(N)
Termination: same
Easy to extend to the affine gap case
x1 ………………………… xM
y1 …
……
……
……
……
…
yN
k(N)
Linear-Space Alignment
Subsequences and Substrings
Definition A string x’ is a substring of a string x,if x = ux’v for some prefix string u and suffix string v
(similarly, x’ = xi…xj, for some 1 i j |x|)
A string x’ is a subsequence of a string xif x’ can be obtained from x by deleting 0 or more letters
(x’ = xi1…xik, for some 1 i1 … ik |x|)
Note: a substring is always a subsequence
Example: x = abracadabray = cadabr; substringz = brcdbr; subseqence, not substring
Hirschberg’s algortihm
Given a set of strings x, y,…, a common subsequence is a string u that is a subsequence of all strings x, y, …
• Longest common subsequence Given strings x = x1 x2 … xM, y = y1 y2 … yN, Find longest common subsequence u = u1 … uk
• Algorithm:F(i – 1, j)
• F(i, j) = max F(i, j – 1)F(i – 1, j – 1) + [1, if xi = yj; 0 otherwise]
• Ptr(i, j) = (same as in N-W)
• Termination: trace back from Ptr(M, N), and prepend a letter to u whenever • Ptr(i, j) = DIAG and F(i – 1, j – 1) < F(i, j)
• Hirschberg’s original algorithm solves this in linear space
F(i,j)
Introduction: Compute optimal score
It is easy to compute F(M, N) in linear space
Allocate ( column[1] )
Allocate ( column[2] )
For i = 1….M
If i > 1, then:
Free( column[ i – 2 ] )
Allocate( column[ i ] )
For j = 1…N
F(i, j) = …
Linear-space alignment
To compute both the optimal score and the optimal alignment:
Divide & Conquer approach:
Notation:
xr, yr: reverse of x, y
E.g. x = accgg;
xr = ggcca
Fr(i, j): optimal score of aligning xr1…xr
i & yr1…yr
j
same as aligning xM-i+1…xM & yN-j+1…yN
Linear-space alignment
Lemma: (assume M is even)
F(M, N) = maxk=0…N( F(M/2, k) + Fr(M/2, N – k) )
x
y
M/2
k*
F(M/2, k) Fr(M/2, N – k)
Example:ACC-GGTGCCCAGGACTG--CATACCAGGTG----GGACTGGGCAG
k* = 8
Linear-space alignment
• Now, using 2 columns of space, we can compute
for k = 1…M, F(M/2, k), Fr(M/2, N – k)
PLUS the backpointers
x1 … xM/2
y1
xM
yN
x1 … xM/2+1 xM
…
y1
yN
…
Linear-space alignment
• Now, we can find k* maximizing F(M/2, k) + Fr(M/2, N-k)
• Also, we can trace the path exiting column M/2 from k*
k*
k*+1
0 1 …… M/2 M/2+1 …… M M+1
Linear-space alignment
• Iterate this procedure to the left and right!
N-k*
M/2M/2
k*
Linear-space alignment
Hirschberg’s Linear-space algorithm:
MEMALIGN(l, l’, r, r’): (aligns xl…xl’ with yr…yr’)
1. Let h = (l’-l)/22. Find (in Time O((l’ – l) (r’ – r)), Space O(r’ – r))
the optimal path, Lh, entering column h – 1, exiting column hLet k1 = pos’n at column h – 2 where Lh enters
k2 = pos’n at column h + 1 where Lh exits
3. MEMALIGN(l, h – 2, r, k1)
4. Output Lh
5. MEMALIGN(h + 1, l’, k2, r’)
Top level call: MEMALIGN(1, M, 1, N)
Linear-space alignment
Time, Space analysis of Hirschberg’s algorithm: To compute optimal path at middle column,
For box of size M N,Space: 2NTime: cMN, for some constant c
Then, left, right calls cost c( M/2 k* + M/2 (N – k*) ) = cMN/2
All recursive calls cost Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN)
Total Space: O(N) for computation, O(N + M) to store the optimal alignment
Heuristic Local Alignerers
1. The basic indexing & extension technique
2. Indexing: techniques to improve sensitivityPairs of Words, Patterns
3. Systems for local alignment
Indexing-based local alignment
Dictionary:
All words of length k (~10)
Alignment initiated between words of alignment score T
(typically T = k)
Alignment:
Ungapped extensions until score
below statistical threshold
Output:
All local alignments with score
> statistical threshold
……
……
query
DB
query
scan
Indexing-based local alignment—Extensions
A C G A A G T A A G G T C C A G T
C
T
G
A
T
C C
T
G
G
A
T
T
G C
G
A
Gapped extensions until threshold
• Extensions with gaps until score < C below best score so far
Output:
GTAAGGTCCAGTGTTAGGTC-AGT
Sensitivity-Speed Tradeoff
long words(k = 15)
short words(k = 7)
Sensitivity Speed
Kent WJ, Genome Research 2002
Sens.
Speed
X%
Sensitivity-Speed Tradeoff
Methods to improve sensitivity/speed
1. Using pairs of words
2. Using inexact words
3. Patterns—non consecutive positions
……ATAACGGACGACTGATTACACTGATTCTTAC……
……GGCACGGACCAGTGACTACTCTGATTCCCAG……
……ATAACGGACGACTGATTACACTGATTCTTAC……
……GGCGCCGACGAGTGATTACACAGATTGCCAG……
TTTGATTACACAGAT T G TT CAC G
Measured improvement
Kent WJ, Genome Research 2002
Non-consecutive words—Patterns
Patterns increase the likelihood of at least one match within a long conserved region
3 common
5 common
7 common
Consecutive Positions Non-Consecutive Positions
6 common
On a 100-long 70% conserved region: Consecutive Non-consecutive
Expected # hits: 1.07 0.97Prob[at least one hit]: 0.30 0.47
Advantage of Patterns
11 positions
11 positions
10 positions
Multiple patterns
• K patterns Takes K times longer to scan Patterns can complement one another
• Computational problem: Given: a model (prob distribution) for homology between two regions Find: best set of K patterns that maximizes Prob(at least one match)
TTTGATTACACAGAT T G TT CAC G T G T C CAG TTGATT A G
Buhler et al. RECOMB 2003Sun & Buhler RECOMB 2004
How long does it take to search the query?
Variants of BLAST
• NCBI BLAST: search the universe http://www.ncbi.nlm.nih.gov/BLAST/• MEGABLAST: http://genopole.toulouse.inra.fr/blast/megablast.html
Optimized to align very similar sequences• Works best when k = 4i 16• Linear gap penalty
• WU-BLAST: (Wash U BLAST) http://blast.wustl.edu/ Very good optimizations Good set of features & command line arguments
• BLAT http://genome.ucsc.edu/cgi-bin/hgBlat Faster, less sensitive than BLAST Good for aligning huge numbers of queries
• CHAOS http://www.cs.berkeley.edu/~brudno/chaos Uses inexact k-mers, sensitive
• PatternHunter http://www.bioinformaticssolutions.com/products/ph/index.php Uses patterns instead of k-mers
• BlastZ http://www.psc.edu/general/software/packages/blastz/ Uses patterns, good for finding genes
• Typhon http://typhon.stanford.edu Uses multiple alignments to improve sensitivity/speed tradeoff