yelena frid and dan gusfield department of computer science uc davis

49
1 A simple, practical and complete O(n 3 /log(n))-time Algorithm for RNA folding using the Four-Russians Speedup Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

Upload: miller

Post on 14-Jan-2016

35 views

Category:

Documents


2 download

DESCRIPTION

A simple, practical and complete O( n 3 /log(n))-time Algorithm for RNA folding using the Four-Russians Speedup. Yelena Frid and Dan Gusfield Department of Computer Science UC Davis. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

1

A simple, practical and complete O(n3/log(n))-time Algorithm

for RNA foldingusing the Four-Russians Speedup

Yelena Frid and Dan GusfieldDepartment of Computer Science UC Davis

Page 2: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

2

Background

-Problem of computationally predicting the secondary structure of RNA molecules was first introduced more than thirty years ago by:

Nussinov et. Al. (1978, 1980)Waterman and Smith (1978)

Zuker M. and Stiegler (1981)

They presented Dynamic programming solutions with an asymptotic runtime of O(n3).

Where n is the length of the RNA sequence.

Page 3: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

3

There have been several improvements to the running time of this problem:

• A complex worse case speed up O(n3*(loglogn)/(logn) 1/2)– (Akutsu 1999)

• Practical heuristic speed up O(nP) where P in [n, n2] (Wexler 2007, Backofen, 2009). This is still O(n3) in terms of n. It is important to note

that P is empirically much less then n2.

Our approach has an O(n3/log(n)) running time using the FOUR RUSSIANS speedup.

Page 4: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

4

Four Russians is well known

- The Four Russians method is a general technique for speeding up Dynamic programs.

- The method involves extensive preprocessing of a wide-range of possible inputs, before any actual input is seen.

However, it has not been applied to RNA folding, despite an ‘expectation’ that it could be.

Page 5: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

5

Possible reasons for difficulty in applying the FOUR RUSSIANS speedupI. Widely exposed version of the original dynamic-

programming algorithm does not lend itself to application of the Four-Russians technique. (Solution: choose a different order of evaluation.)

II. It doesn’t seem possible to separate the preprocessing and the computation. (Solution: interleave the two.)

We made use of the insights presented in Graham et. al. (1980) paper which gives a Four-Russians solution to the problem of Context-Free Language recognition, where similar problems were encountered.

Page 6: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

6

Basic RNA-folding problem

- Find a maximum number of non-crossing matches of complimentary nucleotides in an RNA sequence of length n.

Enhancements:

-Richer scoring schemes

-Minimum energy calculations DP.

Page 7: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

7

Input to the basic RNA-folding problem consists of a string K of length n over a four-letter alphabet {A,U,C,G}

A matching consists of non-crossing disjoint pairs of sites in K.

ii i’ j j’

Page 8: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

B(i,j)

B(i,j) is the score given to the match between nucleotides in sites i and j of K.

In a simple scoring scheme a score of 1 is given for complementary pairs ( A, U or C, G), otherwise 0.

Page 9: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

9

Introduction to the original O(n3) DP

Let S(i,j) represent the score for the optimal folding of the subsequence consisting of the sites in K from i to j>i inclusively.

The DP recurrence relations are based on the different possibilities of pairing nucleotides.

Page 10: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

10

S(i,j) will equal the maximum of :

S(i+1,j-1)+B(i,j) S(i, j-1)

Page 11: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

11

Head Tail

S(i+1, j)

Page 12: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

12

As a result the recurrences for the optimal fold score S(i,j) are:

Head Tail

Page 13: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

13

A U C G G C A U C A A

0 1 2 3 4 5 6 11 12

0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]

1 S[1,2] S[1,3] S[0,4] S[0,5] S[1,6] S[1,11]

2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]

3 S[3,4] S[3,5] S[3,6] S[3,11]

4 S[4,5] S[4,6] S[4,11]

5 S[5,6] S[5,11]

6 S[6,11]

7 S[7,11]

8 S[8,11]

9 S[9,11]

j

……iUsually evaluated

in increasing length of subsequence i.e. distance between i and j.

Page 14: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

14

A U C G G C A U C A A

0 1 2 3 4 5 6 11 12

0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]

1 S[1,2] S[1,3] S[0,4] S[0,5] S[1,6] S[1,11]

2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]

3 S[3,4] S[3,5] S[3,6] S[3,11]

4 S[4,5] S[4,6] S[4,11]

5 S[5,6] S[5,11]

6 S[6,11]

7 S[7,11]

8 S[8,11]

9 S[9,11]

j

……i

-We choose an alternative permissible order.

-The algorithm will evaluate in order of increasing j and decreasing i.

Page 15: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

15

for j= 2 to n

for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a

S(i,j-1) Rule b

for i = j-1 to 1

S(i,j) = max( S(i,j), S(i+1,j) ) Rule c

for k=j-1 to i+1 {Rule d loop}

S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) ) Rule d

S(i,j)=maxO(n2)

Page 16: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

16

for j= 2 to n

for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a

S(i,j-1) Rule b

for i = j-1 to 1

S(i,j) = max( S(i,j), S(i+1,j) ) Rule c

for k=j-1 to i+1 {Rule d loop}

S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) )

S(i,j)=maxO(n)

O(n2)

O(n)

O(n3)

Bottleneck

Page 17: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

17

for j= 2 to n

for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a

S(i,j-1) Rule b

for i = j-1 to 1

S(i,j) = max( S(i,j), S(i+1,j) ) Rule c

for k=j-1 to i+1 {Rule d loop}

S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) )

S(i,j)=max

Idea: Change the decrement by one to a constant number of operations in each group of size q.

Four Russians speedup: first hint

Boria
You can see that for the first for loop or Rule a, bi.e. where i,j pair or j is not involved in a pair and we take the score for the optimal matching of the nucleotides from i to j-1. There is no dependency on an optimal fold that invoves the jth nucleotide.Now case c and d do envolve the jth nucleotide but because we move a long the ith
Page 18: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

What will Rule d loop look like then?

Page 19: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

19

Computation components for speeding up Rule d loop

There are several components for speeding up Rule d loop.• Rgroup• Vg vector• Little vg vector• k*• R Table

At first these may seem unclear and extraneous but will lead to a speed-up.

Page 20: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

20

Rgroup g

Conceptually divide each column in the S matrix into groups of q rows. Each is called an Rgroup, indexed by g.

Page 21: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

21

A U C G G C A U C A A

0 1 2 3 4 5 6 11 12

0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]

1 S[1,2] S[1,3] S[1,4] S[1,5] S[1,6] S[1,11]

2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]

3 S[3,4] S[3,5] S[3,6] S[3,11]

4 S[4,5] S[4,6] S[4,11]

5 S[5,6] S[5,11]

6 S[6,11]

7 S[7,11]

8 S[8,11]

9 S[9,11]

j

……i

Rgroup example where q=3

Rgroup 0

Rgroup 1

Rgroup 2

Page 22: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

22

Computation components for speeding up Rule d loop

• Rgroup

• Vg vector

• Little vg vector

• k*

• R Table

Page 23: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

23

Vg vectorFor a fixed j, consider an Rgroup g consisting of rows

z, z-1, … z-q+1 for some z<j.

Let Vg = {S(z,j), S(z-1, j) ,… S(z-q+1,j)} for all Rgroups g.

V0

V1

Page 24: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

24

Computation components for speeding up Rule d loop

• Rgroup

• Vg vector

• Little vg vector

• k*

Page 25: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

25

Little vg vector

Observe that for the simple scoring scheme:

B(i,j)=1 when (i,j) is a permitted matchB(i,j)=0 when (i,j) is not a permitted match

S(z-1,j) – S(z,j) = {1 or 0}.

V0V15-4=1

5-5=0

Hence, the difference between consecutive values in any Vg vector are either 0 or 1.

Page 26: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

26

Little vg vector

The changes in consecutive values of Vg could therefore be encoded into little vg ,a vector of length q-1.

V0V1

encode*

5-4=1

5-5=0 v0

Base of V0

Page 27: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

27

Little vector vg

V1

V0

encode

3-3=0

4-3=1 v1

- Encoding each Vg into little vg takes O(q) time.

- Each column has j/q different Rgroups.

- O(n) time to do encoding for the entire column.

Page 28: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

Decoding the offset- non formal definition

Decode: vg V`;

where

- V’ is a vector of length q-1

- if V`[l]=4 then Vg[l]=x+4 where x is the value in Vg(0) or the base of the Vg vector.

Page 29: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

29

decode formal definition and exampleIt will be a running sum of the values in little vector vg.

decode: vg ->V’ ; where Example q=7

Vgencode

7-7=0

8-7=1

6-5=1

7-6=1

5-4=1

5-5=0

vg

3+0=3

3+1=4

1+1=2

2+1=3

1

1+0=1

V’decode

Offset from the base

base

Page 30: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

30

Computation components for speeding up Rule d loop

• Rgroup

• Vg vector

• Little vg vector

• k*

• Table R

Page 31: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

31

Defining k* for an Rgroup g

In column j, let k*(i,g,j)

be the index k such that the sum

S(i,k-1)+S(k,j)

is maximized over the indices in Rgroup g.

Page 32: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

32

Example

S(3,4)+S(5,11)S(3,5)+S(6,11)S(3,6)+S(7,11)

For example, assuming column j=11 and row i=3:

To directly compute k*(i,g,j) for Rgroup1 (g=1) we would have to find max of

and store the k that corresponds to that max bipartition.

This would require O(q) time operations.

Page 33: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

33

Computation components for speeding up Rule d loop

• Rgroup

• Vg vector

• Little vg vector

• k*

• Table R

Page 34: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

Table R

Introducing Table R (currently a black box) that given (i,g,vg)

returns k*(i,g,j)

in O(1) operations.

Page 35: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

35

for j= 2 to n

….

for i = j-1 to 1

for k=j-1 to i+1 get vg given Rgroup g

retrieve k*(i,g,j) from R(i,g,vg)

S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )

Modified Rule d loop using Table R

for g=(j-1)/q to (i+1)/q {Rule d loop}

Page 36: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

36

for j= 2 to n

….

for i = j-1 to 1for g=(j-1)/q to (i+1)/q {Rule d loop: where Rgroup g is fully

complete}

get vg given Rgroup g

retrieve k*(i,g,j) from R(i,g,vg)

S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )

Run-time of Modified Rule d loop using Table R

The asymptotic run-time for Rule d loop is O(n2/q)

O(n)

O(n/q)

q*n2

Page 37: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

37

for j= 2 to n

….

for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { where Rgroup g is fully complete}

get vg given Rgroup g

retrieve k*(i,g,j) from R(i,g,vg)

S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )

Modified Rule d loop using Table R Having k* speeds up the

Rule d loop. How can we know k* when we need it?Answer: Four-Russians preprocessing.

Page 38: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

38

Preprocessing

Page 39: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

39

Preprocessing Table R: Cgroups

We conceptually divide the

columns into groups of size q. Example with q=3.

Page 40: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

40

Preprocessing Table RAssume that we run the ‘Second Dynamic Programming Algorithm’ until j=q. That means that all the S(i,j) values in Cgroup 0 have been computed.

At this point we can compute the following:

for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, 0, v) is set to the index k in Rgroup 0

such that S(i,k-1) + V’[k] is maximized.

Page 41: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

41

Preprocessing in general

In general, for Cgroup g > 0, we could do a similar preprocessing after all the entries in columns of Cgroup g have been computed.

That is, R(i,g,v) could be found and stored for all i < g*q.

once Cgroup g=(j/q) is completefor each binary vector v of length q-1 V’=decode(v)

for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that

S(i,k-1) + V’[k] is maximized.

O(qn 2q-1)

Total for pre-processing for all Cgroups is O(qn*2q-1) *

O(n/q)=O(n2*2q-1) time

Page 42: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

42

RNA folding algorithm with Four-Russians Speedup

Page 43: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

43

RNA folding algorithm with Four-Russians Speedup for j= 2 to n for i= 1 to j-1

S(i,j) = max( S(i+1,j-1)+B(i,j) , S(i,j-1))

for i = j-1 to 1 for g=(j-1)/q to (i+1)/q {Rule loop d}

get vg given Rgroup g retrieve k*(i,g,j) from R(i,g,vg)

S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )

once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v)

for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized.

O(n2*2q-1)

O(n3/q)

Page 44: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

44

Run-Time

Setting q= log(n) the total run-time of

O(n2*2q-1) + O(n3/q)+ O(n2q) simplifies to

O(n3/log(n)) time total. Pre-processing Computation

Page 45: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

45

Empirical Results

- We compare our Four-Russians algorithm to the original O(n3)-time algorithm.

- The empirical results shown in Table 1 give the average time for 50 tests of randomly generated RNA sequences and 25 downloaded sequences from genBank, for each size between 1,000bp and 6,000bp.

- The algorithm performs identically for randomly generated and gen-Bank sequences of equal length.

- This is to be expected because the algorithm's run time is sequence character independent.

Page 46: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

46

*

*seconds

Page 47: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

47

Variations on scoring scheme B

The Four Russians Speed-Up could be extended to any B(i,j) for which no differences between S(i,j) and S(i+1,j) depend on n.

Let C denote the size of the set of all possible differences. Then asymptotic time is:

when and

Page 48: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

48

Parallel Computing

When running this algorithm in parallel on n processors we can reduce the total computation to O(log(n)*n2).

(Could be reduced to O(n2) time by a re-ordering method; space trade off)

Page 49: Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

49

Conclusion

- A practical and easy to implement Four-Russians Speed Up algorithm for prediction of RNA secondary structure, could be applied to a wide variety of scoring schemes.

-The asymptotic time of O(n3/log(n)) is achieved by interleaving computation and preprocessing.

-This time can be further lowered to O(log(n)*n2) when using n processors in parallel.