week 3b - david r. cheriton school of computer sciencebrowndg/482s14/notes/week3b.pdf ·...

1 CS 482/682, Spring 2014, Week 3B

Week 3B

Topics for this lecture: •  Local alignment algorithms •  Heuristic methods to make alignment

better Two big ideas for this lecture: Local alignments must be

computed heuristically to avoid hideous runtimes.

There is some pretty great computer science hiding under the hood of alignment algorithms

2 CS 482/682, Spring 2014, Week 3

New topic: local alignment

Typically, when people do alignment, they’re actually finding good local alignments.

Given: two sequences S and T Find: subregions of S and T for which

there’s enough sequence similarity that they’re likely to have come from the homologous model.

(That is, find subregions with greatest score.)

This is done by exactly the same sort of procedure as before, except we add only one change.

3 CS 482/682, Spring 2014, Week 3

Local alignment recursion

Think about it by dynamic programming:

The best local alignment that ends at si and tj is either

•  The empty alignment •  The best alignment ending at si-1 and

tj, followed by si aligned to a gap •  The best alignment ending at tj-1 and

si, followed by tj aligned to a gap •  The best alignment ending at si-1 and

tj-1, followed by si aligned to tj. The score of the empty alignment is zero

(because it’s the logarithm of 1 over 1). [Or because we don’t want to deal with zero divided by zero…]

4 CS 482/682, Spring 2014, Week 3

Local alignment, cont’d

With linear gap penalties, then, M(i,j) = max {0,

M(i-1,j-1)+ s(si,tj), M(i-1,j) +g, M(i,j-1) +g}

And we compute the matrix of “score of longest alignments ending in si and tj” this way.

We want the best possible local alignment.

That’s the one with the highest score in the entire matrix.

When extended to affine gaps, this is the classic Smith-Waterman algorithm.

5 CS 482/682, Spring 2014, Week 3

A quick 1-slide catch-up

We want to find local alignments. We build the matrix that consists of the

score of the best local alignment (which might be the empty one) that ends with si and tj.

That works by traditional dynamic programming.

To incorporate more complicated gap penalties, we may need extra matrices, as for global alignment.

After O(nm) time, we have the matrix. Find the highest entry, and backtrack until the score is zero.

That’s the optimal local alignment.

6 CS 482/682, Spring 2014, Week 3

An important side note

It’s very straightforward to not just find one local alignment of S and T, but many of them.

This is crucially important when several subregions of S and T are evolutionarily conserved.

One key example, which will come to later in the term, is gene finding: the coding parts of genes are conserved, and the other parts, not so much.

But for now, think about that O(nm) runtime.

7 CS 482/682, Spring 2014, Week 3

Is this runtime good enough?

No!

Genbank is 1011 letters long. (Well, it was in 2007…)

To fill in the DP matrix takes 1022 time. That’s no good.

We must have a shortcut. Can we change the question?

Local alignments must be computed heuristically to avoid hideous runtimes

What does that mean?

8 CS 482/682, Spring 2014, Week 3

Heuristic algorithms

Heuristic algorithms are basically “algorithms for the real world”.

They acknowledge that the real world is complicated, and you can’t always get what you want.

A heuristic algorithm promises to probably run fast, and to probably solve the problem pretty close to right.

This is much vaguer than a theorem can guarantee.

9 CS 482/682, Spring 2014, Week 3

Heuristics in sequence alignment

Pairwise sequence alignment as we defined may be kind of stupid:

Why do we build an entire DP table if much of it is a total waste?

Let’s consider a somewhat different problem:

Given: Two sequences S and T. Find: Some good local alignments of S

to T.

We only want to find good alignments. We also might not find them all.

10 CS 482/682, Spring 2014, Week 3

Seeding local alignments

The most important approach, popularized in BLAST:

Most often, a good local alignment will include a region of some length that is perfectly conserved.

(Example: in the best alignment between arginase from staph and arginase from human, there is an 8-amino acid sequence common to both, LVLGGDHS.)

So why not start by assuming we’ll find these “seeds” or “hits”, and then building the local alignment from there?

11 CS 482/682, Spring 2014, Week 3

Easiest to explain for DNA

This is much easier to explain for DNA than for protein, so we’ll start with that.

But the idea spreads to protein as well.

Most of the interesting research in this area in the early 2000s has been done here, at UW.

Yes, really.

12 CS 482/682, Spring 2014, Week 3

Seeded alignment

1)  Start with all matches of k-letters in length (assume only going forward)

2)  For each of those matches, build up left and down right until we reach a place where the DP matrix is 0.

3)  Keep regions that have high score.

(We’ll do an example on the board.)

Why do this? 1)  Runtime goes down substantially 2)  We probably don’t miss an optimal

alignment.

13 CS 482/682, Spring 2014, Week 3

BLASTN

BLASTN matches nucleotide sequences to nucleotide sequences.

It is based on 11-base-long seeds, which must match exactly.

(Vocabulary word: k-mer = a k-base long sequence of DNA. From “polymer.”)

What if we change the seed size, by reducing it by 1?

A 10-base-long sequence occurs once per 1 Mb, if DNA is noise (hah!).

Every time we drop the seed size by 1, we find 4 times as many seeds.

14 CS 482/682, Spring 2014, Week 3

More detail

We start at places where S and T exactly match for k letters.

The expected number of those matchse: roughly 4-knm. Why? Well, there are (n-k+1)(n-k+1) places, which is very close to nm. And each has probability 4-k of being a match.

So the expected number of matches is close to nm4-k.

(You’ll see this in more detail on your homework.)

If n = 100,000, m = 100,000, k = 11, expect around 2500 seeds to be found.

15 CS 482/682, Spring 2014, Week 3

What do we do with seeds?

Try to build an alignment. Remember: this is a heuristic algorithm. We don’t need to find everything. Fast algorithms, like BLAST, quickly

determine if a seed is a good seed or a bad seed.

•  Quick search in both directions; if most symbols match, it’s a good seed. If most don’t match, it’s bad.

•  Build a local alignment around seeds that are chosen.

16 CS 482/682, Spring 2014, Week 3

How do we find the seeds?

We want: •  Places where S matches T for k

letters. How to do that? •  Simplest approach: hash table of all k-

letter substrings of S. •  Look up each k-letter substring of T. •  Matches form seeds.

How do we build the index? For DNA, we can build a trie (you saw

those in 240) of the sequence’s k-mers.

17 CS 482/682, Spring 2014, Week 3

Overall runtime

Build the index: O(n) time. Find matches between the index and T:

O(m) time to scan T, plus we need to record all of the r hits found: O(m+r).

Extend the matches to find true and false hits

•  Probably a constant amount of work, on average, for each hit –  Most hits are random chance

Overall runtime: O(n+m+r). That’s not bad, if r is small!

18 CS 482/682, Spring 2014, Week 3

BLASTP

With nucleotides, we’re requiring k positions with exact matches.

For proteins, that’s not really reasonable: some amino acids mutate to another one very often.

So BLASTP looks for 3- or 4-letter protein sequences that are “very close” to each other, and then builds matches from them.

(Where very close total BLOSUM score in the short window is at least +13)

19 CS 482/682, Spring 2014, Week 3

How to implement that?

With BLASTP: •  Build an automaton that reflects all

string close to short strings in T (the short sequence)

•  Scan S (the longer sequence), looking for matches

•  Extend matchest to alignments. The extension phase is more complex, too.

This is harder, but actually a lot like CS 360, for those of you who took that.

20 CS 482/682, Spring 2014, Week 3

We won’t do the automata

But here’s an approximation of how to do it:

1)  For every possible protein 3-mer from T, find all 3-mers that, if found in S, score at least +13 (or whatever). Build these into a data structure C.

2)  Build a hash table H for S of its 3-mers, just like for the nucleotide case

3)  For every 3-mer x in T, its matches are all of the entries in the hash table H for the keys found at C(x).

C is a pretty small structure: there are only 8000 3-mers.

21 CS 482/682, Spring 2014, Week 3

Which sequence to index?

That’s actually a tough question.

Here’s a typical scenario: •  S is the human genome (length n) •  T1 is a short protein sequence (length

m1) •  T2 is another short sequence (length

m2)

If we’re smart, build an index for S, once, and then look up the short sequences in it.

Added time for T2 is more like O(m2), not O(n+m2).

22 CS 482/682, Spring 2014, Week 3

More on indexing

But memory is a concern: •  Indexing the human genome is

expensive! •  Oh, wait. No, it isn’t, not anymore… you probably should index the longer sequence.

BLASTN (1990) indexes the query, not the database.

BLAT (2000) indexes the database, not the query.

BLASTP also indexes the query. •  Automaton is bigger than sequence •  Protein databases might be pretty big.

23 CS 482/682, Spring 2014, Week 3

Extensions to this idea

Two-hit BLAST: •  Require two seeds (probably shorter)

that are nearer than k from each other, and base the alignment on their enclosing box.

•  Potentially even fewer false positives, but one has to use shorter seeds. There’s quite a tradeoff here; see your next homework.

24 CS 482/682, Spring 2014, Week 3

Wrap-up

•  Local alignment –  Not a full alignment of all letters of S

with all letters of T. –  Just the best sub-intervals.

•  Shortcuts to alignment –  How to avoid Θ(nm) runtimes? That

won’t work with megabase-length sequences!

–  Most common trick: seeds. Short exact or near-exact matches between two sequences are required to find an alignment.

week 3b - david r. cheriton school of computer sciencebrowndg/482s14/notes/week3b.pdf ·...

Documents