yelena frid and dan gusfield department of computer science uc davis

1

A simple, practical and complete O(n3/log(n))-time Algorithm

for RNA foldingusing the Four-Russians Speedup

Yelena Frid and Dan GusfieldDepartment of Computer Science UC Davis

2

Background

-Problem of computationally predicting the secondary structure of RNA molecules was first introduced more than thirty years ago by:

Nussinov et. Al. (1978, 1980)Waterman and Smith (1978)

Zuker M. and Stiegler (1981)

They presented Dynamic programming solutions with an asymptotic runtime of O(n3).

Where n is the length of the RNA sequence.

3

There have been several improvements to the running time of this problem:

• A complex worse case speed up O(n3*(loglogn)/(logn) 1/2)– (Akutsu 1999)

• Practical heuristic speed up O(nP) where P in [n, n2] (Wexler 2007, Backofen, 2009). This is still O(n3) in terms of n. It is important to note

that P is empirically much less then n2.

Our approach has an O(n3/log(n)) running time using the FOUR RUSSIANS speedup.

4

Four Russians is well known

- The Four Russians method is a general technique for speeding up Dynamic programs.

- The method involves extensive preprocessing of a wide-range of possible inputs, before any actual input is seen.

However, it has not been applied to RNA folding, despite an ‘expectation’ that it could be.

5

Possible reasons for difficulty in applying the FOUR RUSSIANS speedupI. Widely exposed version of the original dynamic-

programming algorithm does not lend itself to application of the Four-Russians technique. (Solution: choose a different order of evaluation.)

II. It doesn’t seem possible to separate the preprocessing and the computation. (Solution: interleave the two.)

We made use of the insights presented in Graham et. al. (1980) paper which gives a Four-Russians solution to the problem of Context-Free Language recognition, where similar problems were encountered.

6

Basic RNA-folding problem

- Find a maximum number of non-crossing matches of complimentary nucleotides in an RNA sequence of length n.

Enhancements:

-Richer scoring schemes

-Minimum energy calculations DP.

7

Input to the basic RNA-folding problem consists of a string K of length n over a four-letter alphabet {A,U,C,G}

A matching consists of non-crossing disjoint pairs of sites in K.

ii i’ j j’

B(i,j)

B(i,j) is the score given to the match between nucleotides in sites i and j of K.

In a simple scoring scheme a score of 1 is given for complementary pairs ( A, U or C, G), otherwise 0.

9

Introduction to the original O(n3) DP

Let S(i,j) represent the score for the optimal folding of the subsequence consisting of the sites in K from i to j>i inclusively.

The DP recurrence relations are based on the different possibilities of pairing nucleotides.

10

S(i,j) will equal the maximum of :

S(i+1,j-1)+B(i,j) S(i, j-1)

11

Head Tail

S(i+1, j)

12

As a result the recurrences for the optimal fold score S(i,j) are:

Head Tail

13

A U C G G C A U C A A

0 1 2 3 4 5 6 11 12

0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]

1 S[1,2] S[1,3] S[0,4] S[0,5] S[1,6] S[1,11]

2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]

3 S[3,4] S[3,5] S[3,6] S[3,11]

4 S[4,5] S[4,6] S[4,11]

5 S[5,6] S[5,11]

6 S[6,11]

7 S[7,11]

8 S[8,11]

9 S[9,11]

j

……iUsually evaluated

in increasing length of subsequence i.e. distance between i and j.

14


0 1 2 3 4 5 6 11 12

0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]

1 S[1,2] S[1,3] S[0,4] S[0,5] S[1,6] S[1,11]

2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]

3 S[3,4] S[3,5] S[3,6] S[3,11]

4 S[4,5] S[4,6] S[4,11]

5 S[5,6] S[5,11]

6 S[6,11]

7 S[7,11]

8 S[8,11]

9 S[9,11]

j

……i

-We choose an alternative permissible order.

-The algorithm will evaluate in order of increasing j and decreasing i.

15

for j= 2 to n

for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a

S(i,j-1) Rule b

for i = j-1 to 1

S(i,j) = max( S(i,j), S(i+1,j) ) Rule c

for k=j-1 to i+1 {Rule d loop}

S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) ) Rule d

S(i,j)=maxO(n2)

16

for j= 2 to n


S(i,j-1) Rule b

for i = j-1 to 1



S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) )

S(i,j)=maxO(n)

O(n2)

O(n)

O(n3)

Bottleneck

17

for j= 2 to n


S(i,j-1) Rule b

for i = j-1 to 1



S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) )

S(i,j)=max

Idea: Change the decrement by one to a constant number of operations in each group of size q.

Four Russians speedup: first hint

Boria

You can see that for the first for loop or Rule a, bi.e. where i,j pair or j is not involved in a pair and we take the score for the optimal matching of the nucleotides from i to j-1. There is no dependency on an optimal fold that invoves the jth nucleotide.Now case c and d do envolve the jth nucleotide but because we move a long the ith

What will Rule d loop look like then?

19

Computation components for speeding up Rule d loop

There are several components for speeding up Rule d loop.• Rgroup• Vg vector• Little vg vector• k*• R Table

At first these may seem unclear and extraneous but will lead to a speed-up.

20

Rgroup g

Conceptually divide each column in the S matrix into groups of q rows. Each is called an Rgroup, indexed by g.

21


0 1 2 3 4 5 6 11 12

0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]

1 S[1,2] S[1,3] S[1,4] S[1,5] S[1,6] S[1,11]

2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]

3 S[3,4] S[3,5] S[3,6] S[3,11]

4 S[4,5] S[4,6] S[4,11]

5 S[5,6] S[5,11]

6 S[6,11]

7 S[7,11]

8 S[8,11]

9 S[9,11]

j

……i

Rgroup example where q=3

Rgroup 0

Rgroup 1

Rgroup 2

22


• Rgroup

• Vg vector

• Little vg vector

• k*

• R Table

23

Vg vectorFor a fixed j, consider an Rgroup g consisting of rows

z, z-1, … z-q+1 for some z<j.

Let Vg = {S(z,j), S(z-1, j) ,… S(z-q+1,j)} for all Rgroups g.

V0

V1

24


• Rgroup

• Vg vector


• k*

25

Little vg vector

Observe that for the simple scoring scheme:

B(i,j)=1 when (i,j) is a permitted matchB(i,j)=0 when (i,j) is not a permitted match

S(z-1,j) – S(z,j) = {1 or 0}.

V0V15-4=1

5-5=0

Hence, the difference between consecutive values in any Vg vector are either 0 or 1.

26

Little vg vector

The changes in consecutive values of Vg could therefore be encoded into little vg ,a vector of length q-1.

V0V1

encode*

5-4=1

5-5=0 v0

Base of V0

27

Little vector vg

V1

V0

encode

3-3=0

4-3=1 v1

- Encoding each Vg into little vg takes O(q) time.

- Each column has j/q different Rgroups.

- O(n) time to do encoding for the entire column.

Decoding the offset- non formal definition

Decode: vg V`;

where

- V’ is a vector of length q-1

- if V`[l]=4 then Vg[l]=x+4 where x is the value in Vg(0) or the base of the Vg vector.

29

decode formal definition and exampleIt will be a running sum of the values in little vector vg.

decode: vg ->V’ ; where Example q=7

Vgencode

7-7=0

8-7=1

6-5=1

7-6=1

5-4=1

5-5=0

vg

3+0=3

3+1=4

1+1=2

2+1=3

1

1+0=1

V’decode

Offset from the base

base

30


• Rgroup

• Vg vector


• k*

• Table R

31

Defining k* for an Rgroup g

In column j, let k*(i,g,j)

be the index k such that the sum

S(i,k-1)+S(k,j)

is maximized over the indices in Rgroup g.

32

Example

S(3,4)+S(5,11)S(3,5)+S(6,11)S(3,6)+S(7,11)

For example, assuming column j=11 and row i=3:

To directly compute k*(i,g,j) for Rgroup1 (g=1) we would have to find max of

and store the k that corresponds to that max bipartition.

This would require O(q) time operations.

33


• Rgroup

• Vg vector


• k*

• Table R

Table R

Introducing Table R (currently a black box) that given (i,g,vg)

returns k*(i,g,j)

in O(1) operations.

35

for j= 2 to n

….

for i = j-1 to 1

for k=j-1 to i+1 get vg given Rgroup g

retrieve k*(i,g,j) from R(i,g,vg)

S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )

Modified Rule d loop using Table R

for g=(j-1)/q to (i+1)/q {Rule d loop}

36

for j= 2 to n

….

for i = j-1 to 1for g=(j-1)/q to (i+1)/q {Rule d loop: where Rgroup g is fully

complete}

get vg given Rgroup g



Run-time of Modified Rule d loop using Table R

The asymptotic run-time for Rule d loop is O(n2/q)

O(n)

O(n/q)

q*n2

37

for j= 2 to n

….

for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { where Rgroup g is fully complete}

get vg given Rgroup g



Modified Rule d loop using Table R Having k* speeds up the

Rule d loop. How can we know k* when we need it?Answer: Four-Russians preprocessing.

38

Preprocessing

39

Preprocessing Table R: Cgroups

We conceptually divide the

columns into groups of size q. Example with q=3.

40

Preprocessing Table RAssume that we run the ‘Second Dynamic Programming Algorithm’ until j=q. That means that all the S(i,j) values in Cgroup 0 have been computed.

At this point we can compute the following:

for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, 0, v) is set to the index k in Rgroup 0

such that S(i,k-1) + V’[k] is maximized.

41

Preprocessing in general

In general, for Cgroup g > 0, we could do a similar preprocessing after all the entries in columns of Cgroup g have been computed.

That is, R(i,g,v) could be found and stored for all i < g*q.

once Cgroup g=(j/q) is completefor each binary vector v of length q-1 V’=decode(v)

for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that

S(i,k-1) + V’[k] is maximized.

O(qn 2q-1)

Total for pre-processing for all Cgroups is O(qn*2q-1) *

O(n/q)=O(n2*2q-1) time

42

RNA folding algorithm with Four-Russians Speedup

43

RNA folding algorithm with Four-Russians Speedup for j= 2 to n for i= 1 to j-1

S(i,j) = max( S(i+1,j-1)+B(i,j) , S(i,j-1))

for i = j-1 to 1 for g=(j-1)/q to (i+1)/q {Rule loop d}

get vg given Rgroup g retrieve k*(i,g,j) from R(i,g,vg)


once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v)

for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized.

O(n2*2q-1)

O(n3/q)

44

Run-Time

Setting q= log(n) the total run-time of

O(n2*2q-1) + O(n3/q)+ O(n2q) simplifies to

O(n3/log(n)) time total. Pre-processing Computation

45

Empirical Results

- We compare our Four-Russians algorithm to the original O(n3)-time algorithm.

- The empirical results shown in Table 1 give the average time for 50 tests of randomly generated RNA sequences and 25 downloaded sequences from genBank, for each size between 1,000bp and 6,000bp.

- The algorithm performs identically for randomly generated and gen-Bank sequences of equal length.

- This is to be expected because the algorithm's run time is sequence character independent.

46

*

*seconds

47

Variations on scoring scheme B

The Four Russians Speed-Up could be extended to any B(i,j) for which no differences between S(i,j) and S(i+1,j) depend on n.

Let C denote the size of the set of all possible differences. Then asymptotic time is:

when and

48

Parallel Computing

When running this algorithm in parallel on n processors we can reduce the total computation to O(log(n)*n2).

(Could be reduced to O(n2) time by a re-ordering method; space trade off)

49

Conclusion

- A practical and easy to implement Four-Russians Speed Up algorithm for prediction of RNA secondary structure, could be applied to a wide variety of scoring schemes.

-The asymptotic time of O(n3/log(n)) is achieved by interleaving computation and preprocessing.

-This time can be further lowered to O(log(n)*n2) when using n processors in parallel.

yelena frid and dan gusfield department of computer science uc davis

Documents