a faster reliable algorithm to estimate the p-value of the multinomial llr statistic

A faster reliable algorithm to estimate

the p-value of the multinomial llr statistic

Uri Keich and Niranjan Nagarajan(Department of Computer Science,Cornell University, Ithaca, NY, USA)

http://www.cornell.edu/

Motivation: Hunting for Motifs

Is there a motif here?

Is the alignment interesting/significant? Which of the columns form a motif?

Or for that matter here?

Work with column profiles Null hypothesis: multinomial distribution

Uncovering significant columns

Calculate p-value of the observed score s

Test Statistic: Log-likelihood ratio

where for a random sample of size

and more … Bioinformatics applications:

Sequence-Profile and Profile-Profile alignment

Locating binding site correlations in DNA Detecting compensatory mutations in

proteins and RNA Other applications:

Signal Processing Natural Language Processing

Direct enumeration

Computational Extremes!

Constant time approximation Fails with N fixed and s approaching the tail

Asymptotic approximation As,

possible outcomes Exact results Exponential time and space requirements

(O(NK)) even with pruning of the search space!

Compute pQ the p.m.f. of the integer valued (Baglivo et al, 1992)

where is the mesh size.

Runtime is polynomial in N, K and Q Accurate upto the granularity of the lattice

The middle path

Baglivo et al.’s approach Compute pQ by first computing the DFT!

Let and then,

Then recover pQ by the inverse-DFT

But how do we compute the DFT in the first place without knowing pQ??

We compute

where

Hertz and Stormo’s Algorithm

Sample from which are independent Poissons with mean (instead of X)

Fact:

Recursion:

Baglivo et al.’s Algorithm Recurse over k to compute DFT of

Qk,n Let DQk,n be . Then,

And is upto a constant

Comparison of the two algorithms

Both have O(QKN2) time complexity Space complexity:

O(QN) for Hertz and Stormo’s method O(Q+N) for Baglivo et al.’s method

Numerical errors: Bounded for Hertz and Stormo’s

method. Baglivo et al.’s method can yield

negative p-values in some cases!

Whats going wrong? Hoeffding (1965) proved

that P(I s) f(N,K)e-s

If we compare with

we would get …

And why? Fixed precision arithmetic:

In double precision arithmetic Due to roundoff errors the smaller

numbers in the summation in the DFT are overwhelmed by the largest ones!

Solution: boost the smaller values How? And by how much?

Our Solution We shift pQ(s) with es:

where is the MGF of IQ To compute the DFT of p we replace

the Poisson pk with

and compute

Which to use? With = 1, N = 400, K = 5,

Optimizing Theoretical error bound

Want to calculate P(I > s0) and so a good choice for seems to be

We can solve for numerically. This only adds O(KN2) to the runtime.

So far … We have shown how to

compensate for the errors in Baglivo et al.’s algorithm

provide error guarantees As a bonus we also avoid the need for log-

computation! The updated algorithm has O(Q+N) space

complexity. However the runtime is still O(QKN2) Can we speed it up?

Unfortunately the FFT based convolution introduces more numerical errors: When =1,

A faster variant! Note that the recursive step is a

convolution

Naïve convolution takes time O(N2) An FFT based convolution,

takes O(NlogN) time!

The final piece Need a shift that works well on both

and The variation over l is expected to be small.

So we focus on computing the correct shift for different values of k.

We make an intuitive choice

where Mk(2) is the MGF of

Experimental Results (Accuracy)

Both the shifts work well in practice! We tested our method with K values

upto a 100, N upto 10,000 and various choices of and s

In all cases relative error in comparison to Hertz and Stormo was less than 1e-9

Experimental Results (Runtime)

As expected, our algorithm scales as NlogN as opposed to N2 for Hertz and Stormo’s method!

Recovering the entire range

Imaginary part of computed p is a measure of numerical error

Adhoc test for reliability: Real(p(j)) > 103 × maxj Imag(p(j))

In practise, we can recover the entire p.m.f. using as few as 2 to 3 different s (or equivalently ) values.

For large Ns this is still significantly faster than Hertz and Stormo’s algorithm.

Work in progress … Rigorous error bounds for choice of

2’s Applying the methodology to

compute other statistics Extending the method to

automatically recover the entire range

Exploring applications

a faster reliable algorithm to estimate the p-value of the multinomial llr statistic

Documents

dft of p

s algorithmrecurse

s methodnumerical errors

s approachcompute pq

latticebaglivo et

pi s fn

stormos methodoq n

onlogn time