a faster reliable algorithm to estimate the p-value of the multinomial llr statistic

23
A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science, Cornell University, Ithaca, NY, USA)

Upload: eliot

Post on 06-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

A faster reliable algorithm to estimate the p-value of the multinomial llr statistic. Uri Keich and Niranjan Nagarajan (Department of Computer Science, Cornell University, Ithaca, NY, USA). Motivation: Hunting for Motifs. Is there a motif here?. Is the alignment interesting/significant? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

A faster reliable algorithm to estimate

the p-value of the multinomial llr statistic

Uri Keich and Niranjan Nagarajan(Department of Computer Science,Cornell University, Ithaca, NY, USA)

Page 2: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Motivation: Hunting for Motifs

Page 3: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Is there a motif here?

Is the alignment interesting/significant? Which of the columns form a motif?

Or for that matter here?

Page 4: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Work with column profiles Null hypothesis: multinomial distribution

Uncovering significant columns

Calculate p-value of the observed score s

Test Statistic: Log-likelihood ratio

where for a random sample of size

Page 5: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

and more … Bioinformatics applications:

Sequence-Profile and Profile-Profile alignment

Locating binding site correlations in DNA Detecting compensatory mutations in

proteins and RNA Other applications:

Signal Processing Natural Language Processing

Page 6: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Direct enumeration

Computational Extremes!

Constant time approximation Fails with N fixed and s approaching the tail

Asymptotic approximation As,

possible outcomes Exact results Exponential time and space requirements

(O(NK)) even with pruning of the search space!

Page 7: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Compute pQ the p.m.f. of the integer valued (Baglivo et al, 1992)

where is the mesh size.

Runtime is polynomial in N, K and Q Accurate upto the granularity of the lattice

The middle path

Page 8: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Baglivo et al.’s approach Compute pQ by first computing the DFT!

Let and then,

Then recover pQ by the inverse-DFT

But how do we compute the DFT in the first place without knowing pQ??

Page 9: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

We compute

where

Hertz and Stormo’s Algorithm

Sample from which are independent Poissons with mean (instead of X)

Fact:

Recursion:

Page 10: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Baglivo et al.’s Algorithm Recurse over k to compute DFT of

Qk,n Let DQk,n be . Then,

And is upto a constant

Page 11: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Comparison of the two algorithms

Both have O(QKN2) time complexity Space complexity:

O(QN) for Hertz and Stormo’s method O(Q+N) for Baglivo et al.’s method

Numerical errors: Bounded for Hertz and Stormo’s

method. Baglivo et al.’s method can yield

negative p-values in some cases!

Page 12: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Whats going wrong? Hoeffding (1965) proved

that P(I s) f(N,K)e-s

If we compare with

we would get …

Page 13: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

And why? Fixed precision arithmetic:

In double precision arithmetic Due to roundoff errors the smaller

numbers in the summation in the DFT are overwhelmed by the largest ones!

Solution: boost the smaller values How? And by how much?

Page 14: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Our Solution We shift pQ(s) with es:

where is the MGF of IQ To compute the DFT of p we replace

the Poisson pk with

and compute

Page 15: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Which to use? With = 1, N = 400, K = 5,

Page 16: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Optimizing Theoretical error bound

Want to calculate P(I > s0) and so a good choice for seems to be

We can solve for numerically. This only adds O(KN2) to the runtime.

Page 17: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

So far … We have shown how to

compensate for the errors in Baglivo et al.’s algorithm

provide error guarantees As a bonus we also avoid the need for log-

computation! The updated algorithm has O(Q+N) space

complexity. However the runtime is still O(QKN2) Can we speed it up?

Page 18: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Unfortunately the FFT based convolution introduces more numerical errors: When =1,

A faster variant! Note that the recursive step is a

convolution

Naïve convolution takes time O(N2) An FFT based convolution,

takes O(NlogN) time!

Page 19: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

The final piece Need a shift that works well on both

and The variation over l is expected to be small.

So we focus on computing the correct shift for different values of k.

We make an intuitive choice

where Mk(2) is the MGF of

Page 20: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Experimental Results (Accuracy)

Both the shifts work well in practice! We tested our method with K values

upto a 100, N upto 10,000 and various choices of and s

In all cases relative error in comparison to Hertz and Stormo was less than 1e-9

Page 21: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Experimental Results (Runtime)

As expected, our algorithm scales as NlogN as opposed to N2 for Hertz and Stormo’s method!

Page 22: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Recovering the entire range

Imaginary part of computed p is a measure of numerical error

Adhoc test for reliability: Real(p(j)) > 103 × maxj Imag(p(j))

In practise, we can recover the entire p.m.f. using as few as 2 to 3 different s (or equivalently ) values.

For large Ns this is still significantly faster than Hertz and Stormo’s algorithm.

Page 23: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Work in progress … Rigorous error bounds for choice of

2’s Applying the methodology to

compute other statistics Extending the method to

automatically recover the entire range

Exploring applications