week 3b - david r. cheriton school of computer sciencebrowndg/482s14/notes/week3b.pdf ·...
TRANSCRIPT
1 CS 482/682, Spring 2014, Week 3B
Week 3B
Topics for this lecture: • Local alignment algorithms • Heuristic methods to make alignment
better Two big ideas for this lecture: Local alignments must be
computed heuristically to avoid hideous runtimes.
There is some pretty great computer science hiding under the hood of alignment algorithms
2 CS 482/682, Spring 2014, Week 3
New topic: local alignment
Typically, when people do alignment, they’re actually finding good local alignments.
Given: two sequences S and T Find: subregions of S and T for which
there’s enough sequence similarity that they’re likely to have come from the homologous model.
(That is, find subregions with greatest score.)
This is done by exactly the same sort of procedure as before, except we add only one change.
3 CS 482/682, Spring 2014, Week 3
Local alignment recursion
Think about it by dynamic programming:
The best local alignment that ends at si and tj is either
• The empty alignment • The best alignment ending at si-1 and
tj, followed by si aligned to a gap • The best alignment ending at tj-1 and
si, followed by tj aligned to a gap • The best alignment ending at si-1 and
tj-1, followed by si aligned to tj. The score of the empty alignment is zero
(because it’s the logarithm of 1 over 1). [Or because we don’t want to deal with zero divided by zero…]
4 CS 482/682, Spring 2014, Week 3
Local alignment, cont’d
With linear gap penalties, then, M(i,j) = max {0,
M(i-1,j-1)+ s(si,tj), M(i-1,j) +g, M(i,j-1) +g}
And we compute the matrix of “score of longest alignments ending in si and tj” this way.
We want the best possible local alignment.
That’s the one with the highest score in the entire matrix.
When extended to affine gaps, this is the classic Smith-Waterman algorithm.
5 CS 482/682, Spring 2014, Week 3
A quick 1-slide catch-up
We want to find local alignments. We build the matrix that consists of the
score of the best local alignment (which might be the empty one) that ends with si and tj.
That works by traditional dynamic programming.
To incorporate more complicated gap penalties, we may need extra matrices, as for global alignment.
After O(nm) time, we have the matrix. Find the highest entry, and backtrack until the score is zero.
That’s the optimal local alignment.
6 CS 482/682, Spring 2014, Week 3
An important side note
It’s very straightforward to not just find one local alignment of S and T, but many of them.
This is crucially important when several subregions of S and T are evolutionarily conserved.
One key example, which will come to later in the term, is gene finding: the coding parts of genes are conserved, and the other parts, not so much.
But for now, think about that O(nm) runtime.
7 CS 482/682, Spring 2014, Week 3
Is this runtime good enough?
No!
Genbank is 1011 letters long. (Well, it was in 2007…)
To fill in the DP matrix takes 1022 time. That’s no good.
We must have a shortcut. Can we change the question?
Local alignments must be computed heuristically to avoid hideous runtimes
What does that mean?
8 CS 482/682, Spring 2014, Week 3
Heuristic algorithms
Heuristic algorithms are basically “algorithms for the real world”.
They acknowledge that the real world is complicated, and you can’t always get what you want.
A heuristic algorithm promises to probably run fast, and to probably solve the problem pretty close to right.
This is much vaguer than a theorem can guarantee.
9 CS 482/682, Spring 2014, Week 3
Heuristics in sequence alignment
Pairwise sequence alignment as we defined may be kind of stupid:
Why do we build an entire DP table if much of it is a total waste?
Let’s consider a somewhat different problem:
Given: Two sequences S and T. Find: Some good local alignments of S
to T.
We only want to find good alignments. We also might not find them all.
10 CS 482/682, Spring 2014, Week 3
Seeding local alignments
The most important approach, popularized in BLAST:
Most often, a good local alignment will include a region of some length that is perfectly conserved.
(Example: in the best alignment between arginase from staph and arginase from human, there is an 8-amino acid sequence common to both, LVLGGDHS.)
So why not start by assuming we’ll find these “seeds” or “hits”, and then building the local alignment from there?
11 CS 482/682, Spring 2014, Week 3
Easiest to explain for DNA
This is much easier to explain for DNA than for protein, so we’ll start with that.
But the idea spreads to protein as well.
Most of the interesting research in this area in the early 2000s has been done here, at UW.
Yes, really.
12 CS 482/682, Spring 2014, Week 3
Seeded alignment
1) Start with all matches of k-letters in length (assume only going forward)
2) For each of those matches, build up left and down right until we reach a place where the DP matrix is 0.
3) Keep regions that have high score.
(We’ll do an example on the board.)
Why do this? 1) Runtime goes down substantially 2) We probably don’t miss an optimal
alignment.
13 CS 482/682, Spring 2014, Week 3
BLASTN
BLASTN matches nucleotide sequences to nucleotide sequences.
It is based on 11-base-long seeds, which must match exactly.
(Vocabulary word: k-mer = a k-base long sequence of DNA. From “polymer.”)
What if we change the seed size, by reducing it by 1?
A 10-base-long sequence occurs once per 1 Mb, if DNA is noise (hah!).
Every time we drop the seed size by 1, we find 4 times as many seeds.
14 CS 482/682, Spring 2014, Week 3
More detail
We start at places where S and T exactly match for k letters.
The expected number of those matchse: roughly 4-knm. Why? Well, there are (n-k+1)(n-k+1) places, which is very close to nm. And each has probability 4-k of being a match.
So the expected number of matches is close to nm4-k.
(You’ll see this in more detail on your homework.)
If n = 100,000, m = 100,000, k = 11, expect around 2500 seeds to be found.
15 CS 482/682, Spring 2014, Week 3
What do we do with seeds?
Try to build an alignment. Remember: this is a heuristic algorithm. We don’t need to find everything. Fast algorithms, like BLAST, quickly
determine if a seed is a good seed or a bad seed.
• Quick search in both directions; if most symbols match, it’s a good seed. If most don’t match, it’s bad.
• Build a local alignment around seeds that are chosen.
16 CS 482/682, Spring 2014, Week 3
How do we find the seeds?
We want: • Places where S matches T for k
letters. How to do that? • Simplest approach: hash table of all k-
letter substrings of S. • Look up each k-letter substring of T. • Matches form seeds.
How do we build the index? For DNA, we can build a trie (you saw
those in 240) of the sequence’s k-mers.
17 CS 482/682, Spring 2014, Week 3
Overall runtime
Build the index: O(n) time. Find matches between the index and T:
O(m) time to scan T, plus we need to record all of the r hits found: O(m+r).
Extend the matches to find true and false hits
• Probably a constant amount of work, on average, for each hit – Most hits are random chance
Overall runtime: O(n+m+r). That’s not bad, if r is small!
18 CS 482/682, Spring 2014, Week 3
BLASTP
With nucleotides, we’re requiring k positions with exact matches.
For proteins, that’s not really reasonable: some amino acids mutate to another one very often.
So BLASTP looks for 3- or 4-letter protein sequences that are “very close” to each other, and then builds matches from them.
(Where very close total BLOSUM score in the short window is at least +13)
19 CS 482/682, Spring 2014, Week 3
How to implement that?
With BLASTP: • Build an automaton that reflects all
string close to short strings in T (the short sequence)
• Scan S (the longer sequence), looking for matches
• Extend matchest to alignments. The extension phase is more complex, too.
This is harder, but actually a lot like CS 360, for those of you who took that.
20 CS 482/682, Spring 2014, Week 3
We won’t do the automata
But here’s an approximation of how to do it:
1) For every possible protein 3-mer from T, find all 3-mers that, if found in S, score at least +13 (or whatever). Build these into a data structure C.
2) Build a hash table H for S of its 3-mers, just like for the nucleotide case
3) For every 3-mer x in T, its matches are all of the entries in the hash table H for the keys found at C(x).
C is a pretty small structure: there are only 8000 3-mers.
21 CS 482/682, Spring 2014, Week 3
Which sequence to index?
That’s actually a tough question.
Here’s a typical scenario: • S is the human genome (length n) • T1 is a short protein sequence (length
m1) • T2 is another short sequence (length
m2)
If we’re smart, build an index for S, once, and then look up the short sequences in it.
Added time for T2 is more like O(m2), not O(n+m2).
22 CS 482/682, Spring 2014, Week 3
More on indexing
But memory is a concern: • Indexing the human genome is
expensive! • Oh, wait. No, it isn’t, not anymore… you probably should index the longer sequence.
BLASTN (1990) indexes the query, not the database.
BLAT (2000) indexes the database, not the query.
BLASTP also indexes the query. • Automaton is bigger than sequence • Protein databases might be pretty big.
23 CS 482/682, Spring 2014, Week 3
Extensions to this idea
Two-hit BLAST: • Require two seeds (probably shorter)
that are nearer than k from each other, and base the alignment on their enclosing box.
• Potentially even fewer false positives, but one has to use shorter seeds. There’s quite a tradeoff here; see your next homework.
24 CS 482/682, Spring 2014, Week 3
Wrap-up
• Local alignment – Not a full alignment of all letters of S
with all letters of T. – Just the best sub-intervals.
• Shortcuts to alignment – How to avoid Θ(nm) runtimes? That
won’t work with megabase-length sequences!
– Most common trick: seeds. Short exact or near-exact matches between two sequences are required to find an alignment.