yelena frid and dan gusfield department of computer science uc davis

A simple, practical and complete O(n3/log(n))-time Algorithm

for RNA foldingusing the Four-Russians Speedup

Yelena Frid and Dan GusfieldDepartment of Computer Science UC Davis

Background

-Problem of computationally predicting the secondary structure of RNA molecules was first introduced more than thirty years ago by:

Nussinov et. Al. (1978, 1980)Waterman and Smith (1978)

Zuker M. and Stiegler (1981)

They presented Dynamic programming solutions with an asymptotic runtime of O(n3).

Where n is the length of the RNA sequence.

There have been several improvements to the running time of this problem:

• A complex worse case speed up O(n3*(loglogn)/(logn) 1/2)– (Akutsu 1999)

• Practical heuristic speed up O(nP) where P in [n, n2] (Wexler 2007, Backofen, 2009). This is still O(n3) in terms of n. It is important to note

that P is empirically much less then n2.

Our approach has an O(n3/log(n)) running time using the FOUR RUSSIANS speedup.

Four Russians is well known

- The Four Russians method is a general technique for speeding up Dynamic programs.

- The method involves extensive preprocessing of a wide-range of possible inputs, before any actual input is seen.

However, it has not been applied to RNA folding, despite an ‘expectation’ that it could be.

Possible reasons for difficulty in applying the FOUR RUSSIANS speedupI. Widely exposed version of the original dynamic-

programming algorithm does not lend itself to application of the Four-Russians technique. (Solution: choose a different order of evaluation.)

II. It doesn’t seem possible to separate the preprocessing and the computation. (Solution: interleave the two.)

We made use of the insights presented in Graham et. al. (1980) paper which gives a Four-Russians solution to the problem of Context-Free Language recognition, where similar problems were encountered.

Basic RNA-folding problem

- Find a maximum number of non-crossing matches of complimentary nucleotides in an RNA sequence of length n.

Enhancements:

-Richer scoring schemes

-Minimum energy calculations DP.

Input to the basic RNA-folding problem consists of a string K of length n over a four-letter alphabet {A,U,C,G}

A matching consists of non-crossing disjoint pairs of sites in K.

ii i’ j j’

B(i,j)

B(i,j) is the score given to the match between nucleotides in sites i and j of K.

In a simple scoring scheme a score of 1 is given for complementary pairs ( A, U or C, G), otherwise 0.

Introduction to the original O(n3) DP

Let S(i,j) represent the score for the optimal folding of the subsequence consisting of the sites in K from i to j>i inclusively.

The DP recurrence relations are based on the different possibilities of pairing nucleotides.

S(i,j) will equal the maximum of :

S(i+1,j-1)+B(i,j) S(i, j-1)

Head Tail

S(i+1, j)

As a result the recurrences for the optimal fold score S(i,j) are:

Head Tail

A U C G G C A U C A A

0 1 2 3 4 5 6 11 12

0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]

1 S[1,2] S[1,3] S[0,4] S[0,5] S[1,6] S[1,11]

2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]

3 S[3,4] S[3,5] S[3,6] S[3,11]

4 S[4,5] S[4,6] S[4,11]

5 S[5,6] S[5,11]

6 S[6,11]

7 S[7,11]

8 S[8,11]

9 S[9,11]

……iUsually evaluated

in increasing length of subsequence i.e. distance between i and j.

0 1 2 3 4 5 6 11 12

0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]

1 S[1,2] S[1,3] S[0,4] S[0,5] S[1,6] S[1,11]

2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]

3 S[3,4] S[3,5] S[3,6] S[3,11]

4 S[4,5] S[4,6] S[4,11]

5 S[5,6] S[5,11]

6 S[6,11]

7 S[7,11]

8 S[8,11]

9 S[9,11]

……i

-We choose an alternative permissible order.

-The algorithm will evaluate in order of increasing j and decreasing i.

for j= 2 to n

for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a

S(i,j-1) Rule b

for i = j-1 to 1

S(i,j) = max( S(i,j), S(i+1,j) ) Rule c

for k=j-1 to i+1 {Rule d loop}

S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) ) Rule d

S(i,j)=maxO(n2)

for j= 2 to n

S(i,j-1) Rule b

for i = j-1 to 1

S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) )

S(i,j)=maxO(n)

Bottleneck

for j= 2 to n

S(i,j-1) Rule b

for i = j-1 to 1

S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) )

S(i,j)=max

Idea: Change the decrement by one to a constant number of operations in each group of size q.

Four Russians speedup: first hint

What will Rule d loop look like then?

Computation components for speeding up Rule d loop

There are several components for speeding up Rule d loop.• Rgroup• Vg vector• Little vg vector• k*• R Table

At first these may seem unclear and extraneous but will lead to a speed-up.

Rgroup g

Conceptually divide each column in the S matrix into groups of q rows. Each is called an Rgroup, indexed by g.

0 1 2 3 4 5 6 11 12

0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]

1 S[1,2] S[1,3] S[1,4] S[1,5] S[1,6] S[1,11]

2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]

3 S[3,4] S[3,5] S[3,6] S[3,11]

4 S[4,5] S[4,6] S[4,11]

5 S[5,6] S[5,11]

6 S[6,11]

7 S[7,11]

8 S[8,11]

9 S[9,11]

……i

Rgroup example where q=3

Rgroup 0

Rgroup 1

Rgroup 2

• Rgroup

• Vg vector

• Little vg vector

• k*

• R Table

Vg vectorFor a fixed j, consider an Rgroup g consisting of rows

z, z-1, … z-q+1 for some z<j.

Let Vg = {S(z,j), S(z-1, j) ,… S(z-q+1,j)} for all Rgroups g.

• Rgroup

• Vg vector

• k*

Little vg vector

Observe that for the simple scoring scheme:

B(i,j)=1 when (i,j) is a permitted matchB(i,j)=0 when (i,j) is not a permitted match

S(z-1,j) – S(z,j) = {1 or 0}.

V0V15-4=1

Hence, the difference between consecutive values in any Vg vector are either 0 or 1.

Little vg vector

The changes in consecutive values of Vg could therefore be encoded into little vg ,a vector of length q-1.

encode*

5-5=0 v0

Base of V0

Little vector vg

encode

4-3=1 v1

- Encoding each Vg into little vg takes O(q) time.

- Each column has j/q different Rgroups.

- O(n) time to do encoding for the entire column.

Decoding the offset- non formal definition

Decode: vg V`;

- V’ is a vector of length q-1

- if V`[l]=4 then Vg[l]=x+4 where x is the value in Vg(0) or the base of the Vg vector.

decode formal definition and exampleIt will be a running sum of the values in little vector vg.

decode: vg ->V’ ; where Example q=7

Vgencode

V’decode

Offset from the base

• Rgroup

• Vg vector

• k*

• Table R

Defining k* for an Rgroup g

In column j, let k*(i,g,j)

be the index k such that the sum

S(i,k-1)+S(k,j)

is maximized over the indices in Rgroup g.

Example

S(3,4)+S(5,11)S(3,5)+S(6,11)S(3,6)+S(7,11)

For example, assuming column j=11 and row i=3:

To directly compute k*(i,g,j) for Rgroup1 (g=1) we would have to find max of

and store the k that corresponds to that max bipartition.

This would require O(q) time operations.

• Rgroup

• Vg vector

• k*

• Table R

Table R

Introducing Table R (currently a black box) that given (i,g,vg)

returns k*(i,g,j)

in O(1) operations.

for j= 2 to n

for i = j-1 to 1

for k=j-1 to i+1 get vg given Rgroup g

retrieve k*(i,g,j) from R(i,g,vg)

S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )

Modified Rule d loop using Table R

for g=(j-1)/q to (i+1)/q {Rule d loop}

for j= 2 to n

for i = j-1 to 1for g=(j-1)/q to (i+1)/q {Rule d loop: where Rgroup g is fully

complete}

get vg given Rgroup g

Run-time of Modified Rule d loop using Table R

The asymptotic run-time for Rule d loop is O(n2/q)

O(n/q)

for j= 2 to n

for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { where Rgroup g is fully complete}

get vg given Rgroup g

Modified Rule d loop using Table R Having k* speeds up the

Rule d loop. How can we know k* when we need it?Answer: Four-Russians preprocessing.

Preprocessing

Preprocessing Table R: Cgroups

We conceptually divide the

columns into groups of size q. Example with q=3.

Preprocessing Table RAssume that we run the ‘Second Dynamic Programming Algorithm’ until j=q. That means that all the S(i,j) values in Cgroup 0 have been computed.

At this point we can compute the following:

for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, 0, v) is set to the index k in Rgroup 0

such that S(i,k-1) + V’[k] is maximized.

Preprocessing in general

In general, for Cgroup g > 0, we could do a similar preprocessing after all the entries in columns of Cgroup g have been computed.

That is, R(i,g,v) could be found and stored for all i < g*q.

once Cgroup g=(j/q) is completefor each binary vector v of length q-1 V’=decode(v)

for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that

S(i,k-1) + V’[k] is maximized.

O(qn 2q-1)

Total for pre-processing for all Cgroups is O(qn*2q-1) *

O(n/q)=O(n2*2q-1) time

RNA folding algorithm with Four-Russians Speedup

RNA folding algorithm with Four-Russians Speedup for j= 2 to n for i= 1 to j-1

S(i,j) = max( S(i+1,j-1)+B(i,j) , S(i,j-1))

for i = j-1 to 1 for g=(j-1)/q to (i+1)/q {Rule loop d}

get vg given Rgroup g retrieve k*(i,g,j) from R(i,g,vg)

once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v)

for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized.

O(n2*2q-1)

O(n3/q)

Run-Time

Setting q= log(n) the total run-time of

O(n2*2q-1) + O(n3/q)+ O(n2q) simplifies to

O(n3/log(n)) time total. Pre-processing Computation

Empirical Results

- We compare our Four-Russians algorithm to the original O(n3)-time algorithm.

- The empirical results shown in Table 1 give the average time for 50 tests of randomly generated RNA sequences and 25 downloaded sequences from genBank, for each size between 1,000bp and 6,000bp.

- The algorithm performs identically for randomly generated and gen-Bank sequences of equal length.

- This is to be expected because the algorithm's run time is sequence character independent.

*seconds

Variations on scoring scheme B

The Four Russians Speed-Up could be extended to any B(i,j) for which no differences between S(i,j) and S(i+1,j) depend on n.

Let C denote the size of the set of all possible differences. Then asymptotic time is:

when and

Parallel Computing

When running this algorithm in parallel on n processors we can reduce the total computation to O(log(n)*n2).

(Could be reduced to O(n2) time by a re-ordering method; space trade off)

Conclusion

- A practical and easy to implement Four-Russians Speed Up algorithm for prediction of RNA secondary structure, could be applied to a wide variety of scoring schemes.

-The asymptotic time of O(n3/log(n)) is achieved by interleaving computation and preprocessing.

-This time can be further lowered to O(log(n)*n2) when using n processors in parallel.

yelena frid and dan gusfield department of computer science uc davis

Documents

yelena plaxina, kazakhstan professional portfolio

frid #1 2014

frid 27 nov 15

difference yelena v

utr. prog frid. frans van vught

combinatorial algorithms for haplotype inference pure...

algorithms on string trees and sequences dan gusfield

reference - m. kaye from yelena morri

yelena korelyakova's art portfolio 2014

utr. prog frid. richard yelland

lcda. marcela yelena zambrano morán

(królowe wikingów 02) - maria - frid ingulstad

traditional chuvash costume by yelena druzhinina

women's pole vault promotional 28.08 · 5.03 yelena...

integer programming for phylogenetic and population- genetic...

universidad autónoma de nuevo...

trabajo momento 3-yelena parody

cv yelena lehmann 2017

utr. prog. frid. ying cheng

russia’s role in global sustainable development yelena...