Download - Cache-Oblivious Dynamic Programming for Bioinformatics

Cache-Oblivious Dynamic Programmingfor Bioinformatics

Rezaul Alam Chowdhury, Hai-Son Le, and Vijaya Ramachandran

Abstract—We present efficient cache-oblivious algorithms for some well-studied string problems in bioinformatics including the

longest common subsequence, global pairwise sequence alignment and three-way sequence alignment (or median), both with affine

gap costs, and RNA secondary structure prediction with simple pseudoknots. For each of these problems, we present cache-oblivious

algorithms that match the best-known time complexity, match or improve the best-known space complexity, and improve significantly

over the cache-efficiency of earlier algorithms. We present experimental results which show that our cache-oblivious algorithms run

faster than software and implementations based on previous best algorithms for these problems.

Index Terms—Sequence alignment, median, RNA secondary structure prediction, dynamic programming, cache-efficient,

cache-oblivious.

Ç

1 INTRODUCTION

ALGORITHMS for sequence alignment and for RNAsecondary structure prediction are some of the most

widely studied and widely used methods in bioinformatics.Many of these are dynamic programming algorithms thatrun in polynomial time under the traditional von NeumannModel of computation which assumes a single layer ofmemory with uniform access cost, and many have beenfurther improved in their space usage, mainly using atechnique due to Hirschberg [23]. Modern computers,however, differ significantly from the original von Neumannarchitecture. Unlike von Neumann machines, memory onthese machines is typically organized in a hierarchy withregisters in the lowest level followed by several levels ofcaches (L1, L2, and possibly L3), RAM, and disk. The accesstime and size of each level increases with its depth, and datais transferred in blocks between adjacent levels. Whenexecuted on such a typical modern computer, algorithmsdesigned for the traditional model often cause the processorto stall while waiting for data to be transferred from slowerlevels of memory. The situation is the worst for large datasets that involve block transfers to and from the disk.Therefore, in order to perform well on these machines, newalgorithms must be designed that reduce the number ofblock transfers between different levels of the memoryhierarchy.

Cache-efficiency and cache-oblivious algorithms. Thetwo-level I/O model [1] is a simple abstraction of thememory hierarchy that consists of a cache of size M, and anarbitrarily large main memory partitioned into blocks ofsize B. An algorithm is said to have caused a cache miss if itreferences a block that does not reside in the cache and mustbe fetched from the main memory. The cache complexity (orI/O complexity) of an algorithm is measured in terms of thenumber of cache misses it incurs and thus the number ofblock transfers or I/O operations it causes. Algorithmsdesigned for this model often crucially depend on theknowledge of M and B, and thus do not adapt well whenthese parameters change.

The ideal-cache model [18] is an extension of the two-level I/O model with an additional requirement thatalgorithms must remain oblivious of cache parameters,i.e., cache-oblivious. The model assumes an optimal offlinecache replacement policy, which can be approximated towithin a constant factor by standard cache replacementmethods such as LRU and FIFO. A well-designed cache-oblivious algorithm is flexible and portable, and simulta-neously adapts to all levels of a multilevel memoryhierarchy.

1.1 Our Results

In this paper, we present an efficient cache-obliviousframework that solves a general class of recurrence relationsin two and three dimensions that are amenable to solutionby dynamic programs with “local dependencies.” In adynamic program with local dependencies, the value ofeach cell in the DP table depends only on values in adjacentcells. In principle, our framework can be generalized to anynumber of dimensions, although we study explicitly onlythe 2D and 3D cases. We describe our methodology usingthe simple and well-known longest common subsequence(LCS) problem. We generalize this framework to developcache-oblivious algorithms for several well-known stringproblems in bioinformatics, and show that our algorithmsare both theoretically and experimentally more efficient

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 495

. R.A. Chowdhury is with the Center for Computational Visualization,Institute for Computational Engineering and Sciences, University of Texasat Austin, 1 University Station C0200, Austin, TX 78712-0027.E-mail: [email protected].

. H.-S. Le is with the Department of Machine Learning, Wean Hall, 4612,School of Computer Science, Carnegie Mellon University, 5000 ForbesAvenue, Pittsburgh, PA 15213-3891. E-mail: [email protected].

. V. Ramachandran is with the Department of Computer Sciences, TaylorHall 2.124, University of Texas at Austin, 1 University Station C0500,Austin, TX 78712-1188. E-mail: [email protected].

Manuscript received 16 Aug. 2007; revised 22 May 2008; accepted 3 Aug.2008; published online 21 Aug. 2008.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2007-08-0106.Digital Object Identifier no. 10.1109/TCBB.2008.94.

1545-5963/10/$26.00 � 2010 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

than previous algorithms for these problems. Our resultsfor the string problems we consider are the following (recallthat B is the block transfer size, and M is the size of thecache):

. Global pairwise alignment with affine gap costs. Ona pair of sequences of length n each, our cache-oblivious algorithm runs in Oðn2Þ time, usesOðnÞ space, and incurs Oð n2

BMÞ cache misses.. Median (i.e., optimal alignment of three sequences)

with affine gap costs. Our cache-oblivious algorithm

runs in Oðn3Þ time and Oðn2Þ space, and incurs only

Oð n3

BffiffiffiffiMp Þ cache misses on three sequences of length n

each.

. RNA secondary structure prediction with simple pseu-doknots. On an RNA sequence of length n, our cache-oblivious algorithm runs in Oðn4Þ time, usesOðn2Þ space, and incurs Oð n4

BffiffiffiffiMp Þ cache misses.

Our cache-oblivious algorithms improve on the spaceusage of traditional dynamic programs for each of theproblems we study, and match the space usage of theHirschberg’s space-reduced version [23] of these tradi-tional DPs. However, our space reduction is obtainedthrough a divide-and-conquer strategy that is quitedifferent from the method used in [23]. Hirschberg’sapproach, too, decomposes the problem into subproblems,but uses a method that involves the application oftraditional iterative DP in both forward and backwarddirections. The application of iterative DP results ininefficient cache usage. Moreover, the use of both forwardand backward DP, and particularly the need to combinethe results obtained from them, sometimes complicates theimplementation of Hirschberg’s method (e.g., for multiplesimultaneous recurrences or recurrences with multiplefields). In contrast, our algorithm always applies DP in onedirection and is arguably simpler to implement. A methodfor applying Hirschberg’s space-reduction using forward-only DP is given in [17], but it involves repeated linearscans and thus is not cache-efficient.

In our experimental study of the first two problems, wecompare our implementations to publicly available soft-ware written by others, and for the last one, our comparisonis to our implementation of the best previous algorithm(due to Akutsu [3]). In general, our cache-obliviousalgorithms outperformed the other algorithms for all threeproblems.

In a related work, in [12], we present parallel cache-oblivious implementations of these algorithms for distrib-uted and shared caches, and for multicores. Additionally, wepresent cache-oblivious sequential and parallel algorithmsfor solving several dynamic programming recurrences with“nonlocal dependencies,”1 including those for pairwisesequence alignment with general gap costs, and the basicRNA secondary structure prediction (without pseudoknots)problem in [10], [8], and [12]. These algorithms can beadapted to run efficiently and obliviously of machineparameters (i.e., cache parameters and number of cores)on multilevel memory hierarchies with multicores using thespace-bounded scheduler we introduced in [13].

1.2 Organization of This Paper

In Section 2, we describe our cache-oblivious frameworkfor solving dynamic programming problems with localdependencies. In Section 2.1, we use the simple and well-known 2D dynamic programming recurrence for findingan LCS to describe our methodology. In Section 2.2, weformulate the general d-dimensional framework. InSection 2.3, we establish its I/O lower bound. InSection 2.4, we apply this framework to obtain cache-oblivious algorithms for global pairwise sequence align-ment, median, and RNA secondary structure predictionwith simple pseudoknots. In Section 3, we present ourexperimental results on the three problems.

Preliminary versions of the LCS results in Section 2.1and the I/O lower bound in Section 2.3 appeared in aconference [10].

2 CACHE-OBLIVIOUS DYNAMIC PROGRAMS WITH

LOCAL DEPENDENCIES

2.1 The LCS DP

In this section, we describe our methodology using thesimple and well-known dynamic programming recurrencefor the LCS problem.

A sequence Z ¼ z1z2 . . . zk is called a subsequence ofanother sequence X ¼ x1x2 . . .xn if there exists a strictlyincreasing function f : ½1; 2; . . . ; k� ! ½1; 2; . . . ; n� such thatfor all i 2 ½1; k�, zi ¼ xfðiÞ. In the LCS problem, we are giventwo input sequences, and we need to find a maximum-length subsequence common to both sequences.

Given two sequences X ¼ x1x2 . . .xn and Y ¼ y1y2 . . . yn(for simplicity, we assume equal-length sequences here), wedefine c½i; j� ð0 � i; j � nÞ to be the length of an LCS ofx1x2 . . .xi and y1y2 . . . yj. Then, c½n; n� is the length of an LCSof X and Y , and can be computed using the followingrecurrence relation (see, e.g., [14]):

c½i; j� ¼

0; if i ¼ 0 or j ¼ 0;c½i� 1; j� 1� þ 1; if i; j > 0 ^ xi ¼ yj;

maxc½i; j� 1�;c½i� 1; j�

� �; if i; j > 0 ^ xi 6¼ yj:

8>><>>: ð2:1Þ

We can rewrite the recurrence above in the following form:

c½i; j�¼

h hi; jið Þ; if i ¼ 0 or j ¼ 0;

fhi; ji; hxi; yji;

c½i� 1 : i; j� 1 : j�nc½i; j�

0@

1A; otherwise;

8>><>>: ð2:2Þ

where hð�Þ is an initialization function that always returns 0,and fð�; �; �Þ is the function that computes the value of eachcell based on the values in adjacent cells as follows:

fhi; ji; hxi; yji;

c½i� 1 : i; j� 1 : j� n c½i; j�

� �

¼c½i� 1; j� 1� þ 1; if xi ¼ yj;

maxc½i; j� 1�;c½i� 1; j�

� �; otherwise:

8><>:

Function f uses exactly one cell from its third argument tocompute the final value of c½i; j�, and we call that specific

496 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

1. In a dynamic program with nonlocal dependencies values in some orall cells in the DP table depend on values in nonadjacent cells.

cell the parent cell of c½i; j�. The traceback path from any cell

c½i; j� is the path following the chain of parent cells through

c that ends at some c½i0; j0� with i0 ¼ 0 _ j0 ¼ 0. An LCS can

be extracted from the traceback path starting at c½n; n�.All computations above are performed in the domain of

nonnegative integers (i.e., the set IN of natural numbers).

Recurrence (2.2) gives the general form of a DP with

local dependencies in two dimensions. It can be evaluated

iteratively in Oðn2Þ time, Oðn2Þ space, and Oðn2=BÞ cache

misses. It has been shown in [2], [24], and [32] that the

LCS problem cannot be solved in oðn2Þ time if the

elementary comparison operation is of type “equal/

unequal” and the alphabet size is unrestricted. However,

if the alphabet size is fixed the theoretically fastest known

algorithm runs in Oð n2

lognÞ time and space [33], though this

appears to be impractical to implement. Faster algorithms

exist for different special cases of the problem [6]. If the

alphabet size is unrestricted, this problem can be solved

in Oð hn2

lognÞ time and space [15], where h is the entropy of

the sequences.In most applications, however, the quadratic space

required by an LCS algorithm is a more constraining factor

than its quadratic running time [22]. Fortunately, there are

linear space implementations [23], [29], [5] of the LCS

recurrence, but the cache complexity remains �ðn2

BÞ and the

running time roughly doubles. Hirschberg’s space-reduc-

tion technique [23] for the LCS recurrence has become the

most widely used method for reducing the space complex-

ity of similar DP-based algorithms in computational biology

[34], [36], [42], [26]. However, if a traceback path is not

required it is easy to reduce space requirement of the

iterative algorithm to OðnÞ even without using Hirschberg’s

technique (see, e.g., [14]), and its cache complexity can be

improved to Oð n2

BMÞ using the cache-oblivious stencil-

computation technique [19].

In the rest of this section, we present a cache-obliviousalgorithm for solving recurrence (2.2) along with a trace-back path in Oðn2Þ time, OðnÞ space, and Oð n2

BMÞ cachemisses. It improves over the previous best cache-missbound by at least a factor of M, and reduces spacerequirement by a factor of n when compared with thetraditional iterative solution.

Cache-oblivious algorithm for solving recurrence(2.2). Our algorithm COMPUTE-TRACEBACK-PATH worksby decomposing the given matrix c½1 : n; 1 : n� intosmaller submatrices, and then recursively extracting thefragments of the traceback path from them. For any suchsubmatrix, one can recursively compute the entries on itsoutput boundary (i.e., on its right and bottom boundaries)provided the entries on its input boundary (i.e., entriesimmediately outside of its left and top boundaries) arealready known (Fig. 1). Since the submatrices shareboundaries when the output boundaries of all subma-trices are computed, the problem of finding the tracebackpath through the entire matrix is reduced to the problemof recursively finding the fragments of the path throughthe submatrices. Though we compute all n2 entries of c,at any stage of recursion, we only need to save the entrieson the boundaries of the submatrices and thus we useonly OðnÞ space. The divide-and-conquer strategy alsoimproves locality of computation and consequently leadsto an efficient cache-oblivious algorithm.

We describe below the two parts of our algorithm. Thepseudocodes for both parts are given in Fig. 2. The initialcall is to COMPUTE-TRACEBACK-PATH on the two inputsequences, which computes the traceback path through theinput matrix starting at its bottom-right corner. Thisfunction uses the COMPUTE-BOUNDARY routine for de-composing the input matrix into submatrices each repre-senting a smaller instance of the original problem.

COMPUTE-BOUNDARY. Given the input boundary ofc½i1 : i2; j1 : j2�, this function (Function 2.2) recursivelycomputes its output boundary. For simplicity of exposition,

CHOWDHURY ET AL.: CACHE-OBLIVIOUS DYNAMIC PROGRAMMING FOR BIOINFORMATICS 497

Fig. 1. (Cache-oblivious computation of traceback path (by function COMPUTE-TRACEBACK-PATH)). (a) The inputs are the two sequences Xand Y , and the entries on the left and the top boundaries of matrix Q � c½1 : n; 1 : n�. (b) Forward pass: The output boundaries of three of thefour quadrants of Q are computed recursively (by calling COMPUTE-BOUNDARY) in the following order: Q11, Q12, and Q21. This order ensuresthat the input boundaries of each quadrant are already available at the time it is processed by COMPUTE-BOUNDARY. (c) Backward pass: Thefragments of the traceback path are extracted from the quadrants by calling COMPUTE-TRACEBACK-PATH on them recursively in the oppositeorder of the forward pass.

we assume that i2 � i1 ¼ j2 � j1 ¼ 2q � 1 for some integerq � 0.

If q ¼ 0, the function can compute the output boundarydirectly using recurrence (2.2); otherwise, it decomposes itsquadratic computation space Q (initially Q � c½1 : n; 1 : n�)into four quadrants Qi;j, 1 � i, j � 2, where Qi;j denotes thequadrant that is ith from the top and jth from the left. Itthen computes the output boundary of each quadrantrecursively as the input boundary of the quadrant becomesavailable during the process of computation. After allrecursive calls terminate, the output boundary of Q iscomposed from the output boundaries of the quadrants.

Analysis. Let I1ðnÞ be the cache complexity of COMPUTE-BOUNDARY on input sequences of length n each. If thesequences are small enough so that the entire input of sizeOðnÞ completely fits into the cache, then the only cachemisses incurred by the function will be Oð1þ n

BÞ in order toread the input into the cache initially, and write the outputto the main memory at the end. In this case, the intermediaterecursive function calls will incur no additional cachemisses since the entire input is already in the cache.However, if the input is too large to fit into the cache, the

total number of cache misses incurred by the function is thesum of the cache misses incurred by the recursive functioncalls, andOð1þ n

BÞ additional misses incurred while saving/restoring intermediate outputs between recursive functioncalls. Thus, we have

I1ðnÞ ¼O 1þ n

B

� �; if n � �M;

4I1n2

� �þO 1þ n

B

� �; otherwise;

�

where � is a suitable constant such that computationinvolving two input sequences of length �M each can beperformed completely inside the cache. Solving therecurrence, we obtain I1ðnÞ ¼ Oð1þ n

Bþ n2

BMÞ for all n. Itis straightforward to show that the algorithm runs inOðn2Þ time and uses OðnÞ space, and the cache complex-ity reduces to Oð n2

BMÞ when the input is too large for thecache (i.e., n ¼ �ðMÞ). In contrast, though the standarditerative dynamic programming approach for computingthe output boundary has the same time and spacecomplexities (see, e.g., [14] for a standard technique thatallows the DP to be implemented in OðnÞ space), it incursa factor of M more cache misses.


Fig. 2. Cache-oblivious algorithm for evaluating recurrence (2.2) along with the traceback path. For convenience of exposition, we assume that weonly need to compute c½1 : n; 1 : n�, where n ¼ 2q for some nonnegative integer q. The initial call to COMPUTE-TRACEBACK-PATH is made withX ¼ x1x2 . . .xn, Y ¼ y1y2 . . . yn, L � c½0; 0 : n�, T � c½0 : n; 0�, and P ¼ hðn; nÞi.

COMPUTE-TRACEBACK-PATH. This is the main algo-rithm, which recursively calls both itself and COMPUTE-BOUNDARY. Given the input boundary of c½i1 : i2; j1 : j2�and the entry point of the traceback path on the outputboundary, this function (Function 2.1) recursively computesthe entire path. Recall that a traceback path runs backward,that is, it enters the cube through a point on the outputboundary and exits through the input boundary.

If q ¼ 0, the traceback path can be updated directly usingrecurrence (2.2); otherwise, the function performs twopasses: forward and backward. In the forward pass, itcomputes the output boundaries of all quadrants exceptQ2;2 as in COMPUTE-BOUNDARY. After this pass, thealgorithm knows the input boundaries of all four quad-rants, and the problem reduces to recursively extracting thefragments of the traceback path from each quadrant andcombining them. In the backward pass, the algorithm startsat Q2;2 and updates the traceback path by calling itselfrecursively on the quadrants in the reverse order of theforward pass. This backward order of the recursive calls isessential since in order to find the traceback path through aquadrant the algorithm requires an entry point on its outputboundary through which the path enters the quadrant andinitially this point is known for only one quadrant. Thequadrants are processed in the backward order because itensures that the exit point of the traceback path from onequadrant can be used as the entry point of the path to thenext quadrant in the sequence.

Analysis. Let I2ðnÞ be the cache complexity ofCOMPUTE-TRACEBACK-PATH on input sequences oflength n each. We observe that though the algorithm callsitself recursively four times in the backward pass, at mostthree of those recursive calls will actually be executed andthe rest will terminate at line 1 of the algorithm (see Fig. 2)since the traceback path cannot intersect more than threequadrants. Then, using arguments similar to those used indetermining I1ðnÞ, we have

I2ðnÞ ¼O 1þ n

B

� �; if n � �M;

3I2n2

� �þ 3I1

n2

� �þO 1þ n

B

� �; otherwise;

(

where � is a suitable constant such that the computationinvolving sequences of length �M each can be performedcompletely inside the cache. Solving the recurrence, weobtain I2ðnÞ ¼ Oð1þ n

Bþ n2

BMÞ for all n. The algorithm runsin Oðn2Þ time and uses OðnÞ space, and I2ðnÞ reduces toOð n2

BMÞ provided the inputs are too large to fit into the cache.When compared with the cache complexity of any existingalgorithm for finding the traceback path, our algorithmimproves it by at least a factor of M, and improves the spacecomplexity by a factor of n when compared against thestandard dynamic programming solution.

Our algorithm can be easily extended to handle lengthsthat are not powers of 2 within the same performancebounds. Thus, we have the following theorem.

Theorem 2.1. Given two sequences X and Y of length n each,recurrence (2.2) can be solved and a traceback path can becomputed cache-obliviously in Oðn2Þ time, OðnÞ space, andOð1þ n

Bþ n2

BMÞ cache misses.

If the sequences are long enough (i.e., n ¼ �ðMÞ), the

cache complexity of the algorithm reduces to Oð n2

BMÞ.

2.2 A General Framework for DPs with LocalDependencies

Suppose we are given the following:

. d � 2 sequences Si ¼ si;1si;2 . . . si;n, 1 � i � d, oflength n each, with symbols chosen from anarbitrary finite alphabet �. We define the following(to be used later):

– Given integers ij 2 ½0; n�, j 2 ½1; d�, we denoteby i the sequence of d integers i1; i2; . . . ; id; andby hii we denote the d-dimensional vectorhi1; i2; . . . ; idi.

– By hSii we denote the d-dimensional vectorhs1;i1 ; s2;i2 ; . . . ; sd;idi containing the ijth symbol ofSj in jth position, where each ij 2 ½1; n�.

. An arbitrary set U .

. An initialization function hð�Þ that accepts a vectorhii as input and outputs an element from U.

. A function fð�; �; �Þ that accepts vectors hii and hSii,and an ordered set of 2d � 1 elements from �, andreturns an element of U.

Now, suppose c½0 : n; 0 : n; . . . ; 0 : n� is a d-dimensional

matrix that can store elements from the given set U, and

we want to compute the entries of c using the following

dynamic programming recurrence:

c½i� ¼

h hiið Þ; if 9ij ¼ 0;

f hii; hSii; c

i1 � 1 : i1;i2 � 1 : i2;

. . . ;id � 1 : id

2664

3775 n c½i�

0BB@

1CCA; otherwise:

8>>>><>>>>:

ð2:3Þ

Function f can be arbitrary except that it is allowed to

use exactly one cell from its third argument to compute the

final value of c½i1; i2; . . . ; id� (though it can consider all cells),

which we call the parent cell of c½i1; i2; . . . ; id�. We also

assume that f does not access any memory locations in

addition to those passed to it as inputs except possibly some

constant size local variables.Typically, two types of outputs are expected when

evaluating this recurrence: 1) the value of c½n; n; . . . ; n�and 2) the traceback path starting from c½n; n; . . . ; n�. As in

the case of LCS recurrence, the traceback path from any cell

c½i1; i2; . . . ; id� is the path following the chain of parent cells

through c that ends at some c½i01; i02; . . . ; i0d� with 9 i0j ¼ 0.Each cell of c can have multiple fields and in that case f

must compute a value for each field, though as before, it is

allowed to use exactly one field from its third argument to

compute the final value of any field in c½i1; i2; . . . ; id�. The

definition of traceback path extends naturally to this case,

i.e., when the cells have multiple fields.Recurrence (2.2) in Section 2.1 gives the 2D version of

recurrence (2.3) and the h and f functions for the LCS

recurrence are listed below that recurrence.


Recurrence (2.3) can be solved in OðndÞ time, Oðnd�1Þspace and Oðnd=BÞ cache misses using Hirschberg’stechnique [23].

A straightforward extension of the cache-obliviousalgorithm given in Section 2.1 and Fig. 2 for solving the2D recurrence (2.2) solves the general recurrence (2.3) alongwith a traceback path for any arbitrary dimension d � 2.The algorithm is similar to the 2D algorithm, but thecomputation space is a d-dimensional hypercube, andthe input and output boundaries are of dimension d� 1.The algorithm works by decomposing the hypercube into2d subhypercubes, and computing the output boundaries ofthe subhypercubes recursively in a sequence so that theoutput boundaries of a subhypercube are computed onlyafter its input boundaries become available (possibly asoutputs of recursive calls earlier in the sequence). After theoutput boundaries of all subhypercubes are computed, wecan find the traceback path through the entire hypercube byrecursively extracting the fragments of the path through thesubhypercubes and stitching them together. Thus, we havethe following theorem.

Theorem 2.2. Given d � 2 sequences Si, 1 � i � d, of length n

each, with symbols chosen from an arbitrary finite alphabet,

recurrence (2.3) can be solved and a traceback path can be

computed cache-obliviously in OðndÞ time, Oðnd�1Þ space

and Oð nd

BM1d�1Þ cache misses provided n ¼ �ðM 1

d�1Þ and

M ¼ �ðBd�1Þ.Details of the cache-oblivious algorithm for d ¼ 3 as well

as its pseudocode can be found in [9] and in the PhD thesisof the first author [8].

2.3 I/O Lower Bound

The following theorem establishes that our cache-oblivious algorithm for solving recurrence (2.3) iscache-optimal.

Theorem 2.3. For any d � 2, any algorithm that implements thecomputation defined by recurrence (2.3) must perform�ð nd

BM1

d�1Þ block transfers.

We obtain the lower bound in Theorem 2.3 using the I/Olower bound proved by Hong and Kung [25] for executingthe DAG obtained by taking the product of d directed linegraphs. Let L1 ¼ ðV ;EÞ be a directed line graph, whereV ¼ f1; 2; . . . ; ng and E ¼ fði; iþ 1Þji 2 ½1; n� 1�g. Nodes inL1 represent operations, and edges represent data flow. Thenode with no incoming edges (i.e., node 1) is the uniqueinput and the node with no outgoing edges (i.e., node n) isthe unique output. For d � 2, Ld is obtained by taking theproduct of d such L1’s. Fig. 3b shows L2. Corollary 7.1 in[25] gives the following lower bound on the number of I/Ooperations Q required to execute Ld.

Corollary 7.1 in [25]. For the product Ld with d � 2,Q ¼ �ð nd

M1d�1Þ.

The corollary above assumes that data is transferred to andfrom the cache in blocks of size 1. For block size B,Q ¼ �ð nd

BM1d�1Þ.

Now, consider the computation DAG Gd given byrecurrence (2.3) for dimension d. Fig. 3a shows this DAGfor d ¼ 2. It is easy to see that Ld is, in fact, a sub-DAG of Gd,and hence I/O lower bound for executing Ld also holds forGd. Therefore, Theorem 2.3 follows from the corollary aboveunder the assumption that data is transferred in blocks ofsize B.

2.4 Applications of the Cache-Oblivious Framework

In this section, we apply the cache-oblivious frameworkdescribed in Section 2.2 to obtain cache-oblivious algo-rithms for pairwise sequence alignment, median of threesequences, and RNA secondary structure prediction withsimple pseudoknots.

2.4.1 Pairwise Global Sequence Alignment with Affine

Gap Penalty

Sequence alignment plays a central role in biologicalsequence comparison, and can reveal important relation-ships among organisms. Given two strings X ¼ x1x2 . . .xmand Y ¼ y1y2 . . . yn over a finite alphabet �, an alignment ofX and Y is a matching M of sets f1; 2; . . . ;mg andf1; 2; . . . ; ng such that if ði; jÞ, ði0; j0Þ 2M and i < i0 holdthen j < j0 must also hold [26]. The ith letter ofX or Y is said


Fig. 3. (I/O lower bound for DP implementing recurrence (2.3)). (a) Computational DAG G2 implementing recurrence (2.3) for d ¼ 2. The nodes

colored white represent input nodes. (b) Product graph L2 of two line graphs ðL1Þ, which is a sub-DAG of DAG G2 shown in (a). Hence, I/O lower

bound for executing L2 also holds for G2.

to be in a gap if it does not appear in any pair in M. Given agap penalty g and a mismatch cost sða; bÞ for each pair a,b 2 �, the basic (global) pairwise sequence alignment problemasks for a matching Mopt for which ðmþ n� jMoptjÞ � gþPða;bÞ2Mopt

sða; bÞ is minimized [26].For simplicity of exposition, we will assume m ¼ n for

the rest of this section.The formulation of the basic sequence alignment

problem favors a large number of small gaps while realbiological processes favor the opposite. The alignment canbe made more realistic by using an affine gap penalty [20], [4]which has two parameters: a gap introduction cost go and agap extension cost ge. A run of k gaps incurs a total cost ofgo þ ge � k.

In [20], Gotoh presented an Oðn2Þ time and Oðn2Þ space

DP algorithm for solving the global pairwise alignment

problem with affine gap costs. The algorithm incurs

Oðn2

BÞ cache misses. The space complexity of the algorithm

can be reduced to OðnÞ using Hirschberg’s space-reduction

technique [34] or the diagonal checkpointing technique

described in [21]. Gotoh’s algorithm solves the following

DP recurrences:

Dði; jÞ ¼Gð0; jÞ þ ge; if i ¼ 0 ^ j > 0;

minDði� 1; jÞ;Gði� 1; jÞ þ go

� �þ ge; if i > 0 ^ j > 0;

8<:

ð2:4Þ

Iði; jÞ ¼Gði; 0Þ þ ge; if i > 0 ^ j ¼ 0;

minIði; j� 1Þ;

Gði; j� 1Þ þ go

� �þ ge; if i > 0 ^ j > 0;

8<:

ð2:5Þ

Gði; jÞ ¼

0; if i ¼ 0 ^ j ¼ 0;

go þ ge � j; if i ¼ 0 ^ j > 0;

go þ ge � i; if i > 0 ^ j ¼ 0;

minDði; jÞ; Iði; jÞ;Gði� 1; j� 1Þþsðxi; yjÞ

8<:

9=;; if i > 0 ^ j > 0:

8>>>>>>><>>>>>>>:

ð2:6Þ

The optimal alignment cost is minfGðn; nÞ; Dðn; nÞ;Iðn; nÞg and an optimal alignment can be traced back from

the smallest of Gðn; nÞ, Dðn; nÞ and Iðn; nÞ.Cache-oblivious implementation. Recurrences (2.4)-(2.6)

can be viewed as evaluating a single matrix c½0 : n; 0 : n�with three fields: D, I, and G. These recurrences can betreated as a single recurrence matching the general recur-rence (2.3) for d ¼ 2 by defining functions h and f to outputtriplets with their first, second, and third entries containingvalues for the D, I, and G fields of c, respectively. Forexample, fðhi; ji; hxi; yji; c½i� 1 : i; j� 1 : j� n c½i; j�Þ returns atriplet hvD; vI; vGi, where

vD ¼ min c½i� 1; j�:D; c½i� 1; j�:Gþ gof g þ ge;

vI ¼ min c½i; j� 1�:I; c½i; j� 1�:Gþ gof g þ ge;

vG ¼ minc½i; j�:D; c½i; j�:I;

c½i� 1; j� 1�:Gþ sðxi; yjÞ

( ):

Therefore, function COMPUTE-BOUNDARY in Fig. 2 canbe used to compute the optimal alignment cost cache-obliviously, and COMPUTE-TRACEBACK-PATH can be usedto extract the optimal alignment. Thus, the following claimfollows from Theorem 2.1 in Section 2.1.

Claim 2.1. Optimal global alignment of two sequences oflength n each can be performed cache-obliviously using anaffine gap cost in Oðn2Þ time, OðnÞ space and Oð n2

BMÞ cachemisses.

The cache complexity of our cache-oblivious algorithmis a factor of M improvement over previous algorithms[20], [34].

2.4.2 Alignment of Three Sequences (Median)

Given three sequences X, Y , and Z, the median problem asksfor a sequence W such that the sum of the pairwisealignment costs of W with X, Y , and Z is minimized. Thesequence W is called the median of the three givensequences. In this section, we will assume affine gap costsfor the alignments. The three-way sequence alignment canbe obtained as the pairwise alignment of each of X, Y , andZ with the median sequence W . Hence, we will focus onfinding the median sequence.

In [28], Knudsen presented a dynamic programmingalgorithm for optimal multiple alignment of any number ofsequences related by a tree under affine gap costs. Theinput sequences are assumed to be at the leaves of the tree,and the optimal alignment cost is the minimum sum ofpairwise alignment costs of the sequence pairs at the endsof each edge of the tree over all possible ancestralsequences (i.e., the unknown sequences at the internalnodes of the tree). For N sequences of length n each,the algorithm runs in Oð16:81NnNÞ time and usesOð7:442NnNÞ space. For N ¼ 3, Knudsen’s algorithm solvesthe median problem in Oðn3Þ time and space, and incursOðn3

BÞ cache misses. An Ukkonen-based algorithm for themedian problem is presented in [38], which performs wellespecially for sequences whose (three-way) edit distance �is small. This method performs DP on the edit distanceinstead of sequence lengths. On average, it requires Oðnþ�3Þ time and space. A space-reduced version of thealgorithm uses Oðnþ �2Þ space, but runs in Oðn log � þ �3Þtime on average [38].

Knudsen’s algorithm [28] for three sequences (say,X ¼ x1x2 . . .xn, Y ¼ y1y2 . . . yn, and Z ¼ z1z2 . . . ; zn) is adynamic program over a 3D matrix K. These threesequences are assumed to be at the leaves of a star-shapedtree, the root of which corresponds to the ancestor/mediansequence W . Each entry Kði; j; kÞ is composed of severalfields, each of which corresponds to an indel configurationthat keeps track of ongoing insertions/deletions in thealignment. In a multiple alignment, if we compare thesymbols x and w at location l of X and W , respectively, theywill be in one of the following three states: 1) x is a residueand w is a gap (i.e., an insertion to X), 2) x is a gap and w is aresidue (i.e., a deletion from X), and 3) either both of themare residues or both are gaps. Thus, if all three inputsequences are considered, we will have a total of 33 ¼ 27such possibilities, each of which is called an indel


configuration. However, 4 of those 27 configurations areconsidered invalid since they lead to contradictory state

assignments for the sequences. Hence, Kði; j; kÞ consists of

only 23 fields. A residue configuration defines how the nextthree symbols of the sequences will be matched. Each

configuration is a vector e ¼ ðe1; e2; e3; e4Þ, where ei 2 f0; 1g,1 � i � 4. The entry ei, 1 � i � 3, is 1 if the aligned symbol

of sequence i is a residue, and 0 otherwise. The last entry e4

corresponds to the aligned symbol of the median sequence.

A residue configuration is valid provided at least one of e1,

e2 and e3 is 1, and if more than one of them is 1 then e4 isalso 1 (see [28] for the reasoning behind these conditions).

Thus, only 10 of the 24 ¼ 16 possible residue configurations

are valid. We define �ðe; q0Þ ¼ q if applying the residueconfiguration e to the indel configuration q0 leads to the

indel configuration q. Knudsen’s algorithm uses the follow-

ing recurrence relation which for any indel configuration q

computes the field Kði; j; kÞq from all fields Kði0; j0; k0Þq0such that for some residue configuration e, �ðe; q0Þ ¼ q,i0 ¼ i� e1, j0 ¼ j� e2 and k0 ¼ k� e3:

Kði; j; kÞq

¼

0; if i¼j¼k¼0 ^ q¼qo;

1; if i¼j¼k¼0 ^ q 6¼ qo;

mine;q0:q¼�ðe;q0Þ

Kði0; j0; k0Þq0 þGe;q0

þMði0;j0;k0Þ!ði;j;kÞ

( ); otherwise;

8>>>>><>>>>>:

ð2:7Þ

where qo is the indel configuration where all symbolsmatch, Mði0;j0;k0Þ!ði;j;kÞ is the matching cost when going from

alignment ðxi0 ; yj0 ; zk0 Þ to alignment ðxi; yj; zkÞ, and Ge;q0 is

the gap cost of applying e on q0.The M and G matrices can be precomputed. Therefore,

Knudsen’s algorithm runs in Oðn3Þ time and space withOðn3

BÞ cache misses.Cache-oblivious algorithm. In order to make recurrence

(2.7) match the general recurrence (2.3) for d ¼ 3 given inSection 2, we shift all symbols of X, Y , and Z one position tothe right, introduce a dummy symbol in front of each ofthose three sequences, and obtain the following recurrenceby modifying recurrence (2.7), where c½i; j; k�q ¼ Kði� 1;j� 1; k� 1Þq for 1 � i; j, k � nþ 1 and any q:

c½i; j; k�q ¼

1; if i ¼ 0 _ j ¼ 0 _ k ¼ 0;

0; if i ¼ j ¼ k ¼ 1 ^ q ¼ qo;1; if i ¼ j ¼ k ¼ 1 ^ q 6¼ qo;

mine;q0:q¼�ðe;q0Þ

c½i0; j0; k0�q0 þGe;q

þMði0;j0;k0Þ!ði;j;kÞ

( ); otherwise:

8>>>>>><>>>>>>:

If i ¼ 0 or j ¼ 0 or k ¼ 0, then c½i; j; k�q can be evaluatedusing a function hðhi; j; kiÞ ¼ 1 as in the general recur-rence (2.3). Otherwise, the value of c½i; j; k�q depends onthe values of i, j, and k, values in some constant sizearrays (G and M), and on the cells to its left, back, and top.Hence, in this case, c½i; j; k�q can be evaluated using afunction similar to f in recurrence (2.3) for d ¼ 3. Thisfunction is defined simply by the last three rows of therecurrence for c½i; j; k�q. Therefore, the above recurrence

matches the 3D version of the general recurrence (2.3), andwe claim the following using Theorem 2.2.

Claim 2.2. Optimal alignment of three sequences of length n eachcan be performed and the median sequence under the optimalalignment can be computed cache-obliviously using an affinegap cost in Oðn3Þ time, Oðn2Þ space, and Oð n3

BffiffiffiffiMp Þ cache

misses.

2.4.3 RNA Secondary Structure Prediction with Simple

Pseudoknots

A single-stranded RNA can be viewed as a stringX ¼ x1x2 . . .xn over the alphabet fA;U;G;Cg of bases. AnRNA strand tends to give rise to interesting structures byforming complementary base pairs with itself. An RNAsecondary structure (without pseudoknots) is a planar graphwith the nesting condition: if fxi; xjg and fxk; xlg form basepairs and i < j, k < l and i < k hold, then either i<k<l<jor i < j < k < l [42], [39], [3]. An RNA secondary structurewith pseudoknots is a structure where this nesting conditionis violated [39], [3].

In [3], Akutsu presented a DP to compute RNAsecondary structures with maximum number of base pairsin the presence of simple pseudoknots (see [3] for definition)which runs in Oðn4Þ time, Oðn3Þ space, and Oðn4

BÞ cachemisses. In this section, we improve its space and cachecomplexities to Oðn2Þ and Oð n4

BffiffiffiffiMp Þ, respectively, without

changing its time complexity.We list below the DP recurrences used in Akutsu’s

algorithm [3]. For every pair ði0; k0Þ with 1 � i0 � k0 �2 � n� 2, recurrences (2.8)-(2.12) compute the maximumnumber of base pairs in a pseudoknot with endpoints atthe i0th and k0th residues. The value computed byrecurrence (2.12), i.e., Spseudoði0; k0Þ, is the desired value.Recurrences (2.8)-(2.10) consider three locations ði; j; kÞ(i0 � 1 � i < j � k � k0) on the RNA at a time. Recur-rences (2.8) and (2.9) correspond to cases where fxi; xjgand fxj; xkg form base pairs, respectively, while recur-rence (2.10) handles the case where neither fxi; xjg norfxj; xkg forms a base pair. The variable SMAXði; j; kÞ inrecurrence (2.11) contains the maximum score for the tripleði; j; kÞ. In recurrences (2.8) and (2.9), vðxs; ytÞ ¼ 1 if fxs; ytgform a base pair; otherwise, vðxs; ytÞ ¼ �1. All uninitia-lized entries are assumed to have value 0:

SLði; j; kÞ ¼vðxi; xjÞ; if i0 � i < j � k;

vðai; ajÞþSMAXði� 1; jþ 1; kÞ; if i0 � i < j < k;

(

ð2:8Þ

SRði; j; kÞ

¼vðxj; xkÞ; if i0 � 1 ¼ i < j� 1 ¼ k� 2;

vðaj; akÞþSMAXði; jþ 1; k� 1Þ; if i0 � i < j < k;

8<:

ð2:9Þ

SMði; j; kÞ ¼ max

SLði� 1; j; kÞ;SMði� 1; j; kÞ;SMAXði; jþ 1; kÞ;SMði; j; k� 1Þ;SRði; j; k� 1Þ

8>>><>>>:

9>>>=>>>;

if i0 � i < j < k;

ð2:10Þ


SMAXði; j; kÞ ¼ max SLði; j; kÞ; SMði; j; kÞ; SRði; j; kÞf g;ð2:11Þ

Spseudoði0; k0Þ ¼ maxi0�i<j<k�k0

SMAXði; j; kÞf g: ð2:12Þ

After computing all entries of SMAX for a fixed i0, all

Spseudoði0; k0Þ values for k0 � i0 þ 2 can be computed using

(2.12) in Oðn3Þ time and space and Oðn3

BÞ cache misses. Since

there are n� 2 possible values for i0, all Spseudoði0; k0Þ can be

computed in Oðn4Þ time, Oðn3Þ space, and Oðn4

BÞ cache

misses.Finally, the following recurrence computes the optimal

score Sð1; nÞ for the entire structure:

Sði; jÞ ¼ maxSpseudoði; jÞ; Sðiþ 1; j� 1Þ þ vðai; ajÞ;

maxi<k�j Sði; k� 1Þ; Sðk; jÞf g

( ):

ð2:13Þ

Iterative evaluation of recurrence (2.13) requires Oðn3Þtime and Oðn2Þ space, and incurs Oðn3

BÞ cache misses [3]

which is sufficient for our purposes. This recurrence can be

evaluated in only Oð n3

BffiffiffiffiMp Þ cache misses without changing

the other bounds using the cache-oblivious Gaussian

Elimination Paradigm (GEP) framework we presented in

[10] and [11].Space reduction. We now describe our space reduction

result. A similar method was suggested in [31]. Observe

that evaluating recurrence (2.12) requires retaining all Oðn3Þvalues computed by recurrence (2.11). We avoid using this

extra space by computing all required Spseudoði0; k0Þ values

on the fly while evaluating recurrence (2.11). We achieve

this goal by introducing recurrence (2.14), replacing recur-

rence (2.12) with recurrence (2.15) for S0pseudo, and using

S0pseudo instead of Spseudo for evaluating recurrence (2.13).

Intuitively, the variable SP ði; j; kÞ in recurrence (2.14) stores

the maximum score among all triples ði; j0; kÞwith j0 � j. All

uninitialized entries in recurrences (2.14) and (2.15) are

assumed to have value �1:

SP ði; j; kÞ ¼max

SMAXði; j; kÞ;SP ði; jþ 1; kÞ

( ); if i0 � i < j < k;

SP ði; jþ 1; kÞ; if i0 � i � j < k;

8><>:

ð2:14Þ

S0pseudoði0; k0Þ ¼ max

S0pseudoði0; k0 � 1Þ;

maxi0�i<k0�1

SP ði; i0 þ 1; k0Þf g

8<:

9=;

if k0 � i0 þ 2:

ð2:15Þ

We claim that recurrence (2.15) computes exactly the

same values as recurrence (2.12).

Claim 2.3. For 1 � i0 � k0 � 2 � n� 2, S0pseudoði0; k0Þ ¼Spseudoði0; k0Þ.

Proof. (sketch) We obtain the following by simplifying

recurrence (2.14):

SP ði; j; kÞ ¼max

maxfiþ1;jg�j0<kSMAXði; j0; kÞf g; if i0 � i ^ j < k;

�1; otherwise:

(

Therefore,

maxi0�i<k0�1

SP ði; i0 þ 1; k0Þf g ¼ maxi0�i<j<k0

SMAXði; j; k0Þf g:

We can now evaluate S0pseudoði0; k0Þ by induction on k0.For k0 � i0 þ 2,

S0pseudoði0; k0Þ ¼ maxS0pseudoði0; k0 � 1Þ;

maxi0�i<k0�1

SP ði; i0 þ 1; k0Þf g

( )

¼ max

maxi0�i<j<k�k0�1

SMAXði; j; kÞf g

maxi0�i<j<k0

SMAXði; j; k0Þf g

8<:

9=;

¼ maxi0�i<j<k�k0

SMAXði; j; kÞf g

¼Spseudoði0; k0Þ:

tu

Now, observe that in order to evaluate recurrence (2.15),we only need the values SP ði; j; kÞ for j ¼ i0 þ 1, and eachentry ði; j; kÞ in recurrences (2.8)-(2.11) and (2.14) dependsonly on entries ð�; j; �Þ and ð�; jþ 1; �Þ. Therefore, we willevaluate the recurrences for j ¼ n first, then for j ¼ n� 1,and continue down to j ¼ i0 þ 1. Observe that in order toevaluate for j ¼ j0, we only need to retain the Oðn2Þ entriescomputed for j ¼ j0 þ 1. Thus, for a fixed i0, all SP ði; i0 þ1; kÞ and consequently all relevant S0pseudoði0; k0Þ can becomputed using only Oðn2Þ space, and the same space canbe reused for all n values of i0.

Cache-oblivious algorithm. The evaluation of recur-

rences (2.8)-(2.11) and (2.14) can be viewed as evaluating a

single n� n� n matrix c with five fields: SL, SR, SM , SMAX,

and SP . If we replace all j with n� jþ 1 in the resulting

recurrence, it conforms to recurrence (2.3) for d ¼ 3.

Therefore, for any fixed i0, we can use our cache-oblivious

algorithm to compute all entries SP ði; i0 þ 1; kÞ and conse-

quently all relevant S0pseudoði0; k0Þ values in Oðn3Þ time,

Oðn2Þ space, and Oð n3

BffiffiffiffiMp Þ cache misses. All S0pseudoði0; k0Þ

values can be computed by n applications (i.e., once for

each i0) of the algorithm.For any given pair ði0; k0Þ, the pseudoknot with the

optimal score can be traced back cache-obliviously in Oðn3Þtime, Oðn2Þ space, and Oð n3

BffiffiffiffiMp Þ cache misses using our

algorithm.Thus, from Theorem 2.2, we obtain the following claim.

Claim 2.4. Given an RNA sequence of length n, a secondarystructure that has the maximum number of base pairs in thepresence of simple pseudoknots can be computed cache-obliviously in Oðn4Þ time, Oðn2Þ space, and Oð n4

BffiffiffiffiMp Þ cache

misses.

Extensions. In [3], the basic dynamic program for simplepseudoknots has been extended to handle energy functionsbased on adjacent base pairs within the same time and spacebounds. Our cache-oblivious technique as described abovecan be adapted to solve this extension within the same


improved bounds as for the basic DP. An Oðn4��Þ timeapproximation algorithm for the basic DP has also beenproposed in [3], and our techniques can be used to improvethe space and cache complexity of the algorithm to Oðn2Þ(from Oðn3Þ) and Oð n4��

BffiffiffiffiMp Þ (from Oðn4��

B Þ), respectively.

3 EXPERIMENTAL RESULTS

In this section, we present experimental results for the threebioinformatics applications discussed in Section 2.4. We usedthe machines listed in Table 1 for our experiments. Allmachines ran Ubuntu Linux 5.10. All our algorithms wereimplemented in C++ (compiled with g++ 3.3.4), while somesoftware packages we used for comparison were written in C(compiled with gcc 3.3.4). Optimization parameter -O3 wasused in all cases. Each machine was exclusively used forexperiments, and only one processor was used. TheCachegrind profiler [40] was used for simulating cache effects.

In order to reduce the overhead of recursion in the cache-oblivious algorithms, in our implementations, we did notrun the recursion all the way down to sequence length r ¼ 1.Instead, we stopped the recursion at a larger value of r, andsolved the subproblem at that size using the traditionaliterative method. This is a commonly used methodology: if

we were to keep the base case size at 1, the cost of theoverheads associated with a recursive call would far exceedthe useful computation performed during execution of thebase case. Note that this is unrelated to the cache size, exceptthat one would want to use a base-case size that issufficiently small so that it does not overflow the cache.The values of r that we used were r ¼ 256 for pairwisealignment (PA-CO), and r ¼ 64 for median (MED-CO) andfor RNA secondary structure with simple pseudoknots(RNA-CO). Details on the iterative methods used to solvethe base cases can be found in the theses [8], [30].

Our cache-oblivious algorithms outperformed currentlyavailable software and methods for all three applications.We describe details of our experimental results below.

3.1 Pairwise Global Sequence Alignment withAffine Gap Penalty

We performed experimental evaluation of the implementa-tions listed in Table 2: PA-CO is our implementation ofour linear-space cache-oblivious algorithm given inSection 2.4.1, and PA-FASTA is the implementation of thelinear-space version of Gotoh’s algorithm [34] available inthe fasta2 package [35].


TABLE 1Machines Used for Experiments; On All Machines Only One Processor Was Used

TABLE 2Pairwise Sequence Alignment Algorithms Used in Our Experiments

Fig. 4. Pairwise alignment on random sequences: All running times are normalized w.r.t. that of PA-CO. Each data point is the average of three

independent runs on randomly generated strings over fA; T;G;Cg.

In most cases, for input sequence sizes ranging from1,000 to over a million, PA-FASTA ran about 20 to 30 percentslower than PA-CO on AMD Opteron and up to 10 percentslower on Intel Xeon. We note that most bioinformaticsapplications use relatively short sequences of length lessthan 10,000, and this sequence length falls within the rangeconsidered in Fig. 4, though our experiments also con-sidered much longer sequence pairs.

Random sequences. We ran on randomly generated equal-

length string-pairs over fA;C;G; Tg on AMD Opteron 250

(see Fig. 4a) and Intel P4 Xeon (see Fig. 4b). We varied string

lengths from 1 K to 512 K. In our experiments, PA-FASTA

always ran slower than PA-CO on AMD Opteron (around

27 percent slower for sequences of length 512 K) and

generally the relative performance of PA-CO improved

over PA-FASTA as the length of the sequences increased.

The trend was almost similar on Intel Xeon except that the

improvement of PA-CO over PA-FASTA was more modest.

We also obtained some anomalous results for n 10;000

which we believe is due to architectural affects (cache

misalignment of data in PA-FASTA).

Real-world sequences. We ran PA-CO and PA-FASTA on

CFTR DNA sequences of lengths between 1.3 million and

1.8 million [41] on AMD Opteron, and PA-FASTA ran 20 to

30 percent slower than PA-CO on these sequences (see

Table 3). Though proper alignment of these genomic

sequences require more sophisticated cost functions, run-

ning times of PA-CO and PA-FASTA on these sequences

give us some idea on the relative performance of these

implementations on very long sequences.Cache performance. We measured the number of L1 and

L2 cache misses incurred by both PA-FASTA and PA-COon random sequences (see Fig. 5). Though PA-FASTAcauses fewer cache misses than PA-CO when the input fitsinto the cache, it incurs significantly more misses than PA-CO as the input size grows beyond cache size. On AMDOpteron, PA-FASTA incurs up to 300 times more L1 missesand 2,500 times more L2 misses than PA-CO, while on IntelXeon, the figures are 10 and 1,000, respectively. Observethat the larger the cache, the larger the cache-miss ratiowhich follows theoretical predictions since we know thatPA-CO should incur fewer cache misses on larger cacheswhile cache performance of PA-FASTA should be inde-pendent of cache size.

3.2 Median of Three Sequences

In this section, we report our experimental results for themedian problem (i.e., the problem of determining three-wayglobal sequence alignment with affine gap penalty). Weperformed experimental evaluation of the implementationslisted in Table 4: MED-CO implements our quadratic-spacecache-oblivious median algorithm described in Section 2.4.2,MED-Knudsen is Knudsen’s cubic-space median algorithm[28] implemented by Knudsen himself [27], MED-ukk.allocis theOðnþ �3Þ-space (where � is the three-way edit distanceof sequences) Ukkonen-based median algorithm describedin [38], and MED-ukk.checkp is the space-reduced version ofMED-ukk.alloc based on checkpointing technique. Both


TABLE 3Pairwise Alignment Algorithm on Real Data: Each Entry in Columns 2 and 3 Is the Time for a Single Run

Fig. 5. Ratio of cache misses incurred by PA-FASTA to that incurred by PA-CO (see Table 2) on random sequences for both L1 and L2 caches. Data

was obtained using Cachegrind [40].

Ukkonen-based algorithms were implemented by Powell[37]. Finally, MED-H is our quadratic-space implementationof Knudsen’s algorithm based on Hirschberg’s space-reduction technique, which we consider at the end of thissection.

We used a gap insertion cost of 3 (i.e., go ¼ 3), a gapextension cost of 1 (i.e., ge ¼ 1), and a mismatch cost of 1 inall experiments.

We first compare the performance of our cache-obliviousalgorithm MED-CO with the three publicly available code:MED-Knudsen, MED-ukk.alloc, and MED-ukk.checkp.Overall, MED-Knudsen ran about 1.5-2.5 times slower thanMED-CO on both machines, and MED-ukk.alloc and MED-ukk.checkp were even slower. Furthermore, due to theirhigh space overhead, MED-Knudsen, MED-ukk.alloc, andMED-ukk.checkp could not be run on any machine forsequences longer than 640, while MED-CO ran to comple-tion on sequences of length over 1,000. We summarize ourresults below.

Random sequences. We ran all implementations onrandom (equal-length) sequences of length 64i, 1 � i � 16on AMD Opteron (see Fig. 6a) and Intel Xeon (see Fig. 6b).On both machines, MED-CO ran the fastest, and MED-Knudsen, MED-ukk.alloc, and MED-ukk.checkp crashed asthey ran out of memory for sequences longer than 384, 256,and 640, respectively.

On Intel Xeon, MED-CO ran at least 1.45 times fasterthan MED-Knudsen. Both MED-ukk.alloc and MED-ukk.checkp ran at least two times slower than MED-CO for

length 64, and continued to slow down even further withincreasing sequence length. They ran up to 3.3 times (forlength 256) and 4.8 times (for length 640) slower thanMED-CO, respectively. The trends were similar on AMDOpteron and MED-CO ran at least 2.5, 3.4, and 4.2 timesfaster than MED-Knudsen, MED-ukk.alloc, and MED-ukk.checkp, respectively.

Real-world sequences. We ran the algorithms on triplets of16S bacterial rDNA sequences from the Pseudanabaena group[16] (see Table 5).

The triplets in Table 5 were formed by choosing atrandom three sequences of length less than 500 from thegroup. We report results for the same five triplets on boththe Intel Xeon and AMD Opteron 850. The trends were verysimilar on both machines.

The results show that our cache-oblivious algorithmMED-CO has the best performance, since it runs tocompletion on all triplets and is considerably faster thanthe other three methods in most cases.

The results also closely track theoretical predictions forall four algorithms. The theoretical analysis shows that thenumber of steps, space usage, and cache misses for bothMED-Knudsen and MED-CO should increase with theproduct of the lengths of the three sequences, and we notethe overall running time for both of these algorithmsincreases with this product on both the Intel and AMDmachines. However, MED-Knudsen ran around 35 to50 percent slower than MED-CO on the Xeon and overtwice as slow on the Opteron 850.


TABLE 4Median Algorithms Used in Our Experiments

Fig. 6. Median on random data: Each data point is the average of three independent runs on random strings over fA; T;G;Cg.

On the other hand, the resource usage of the twoUkkonen-based methods increases with the alignment cost,and hence these methods are best suited for sequences withsmall alignment cost. This again shows up in our results:On the Opteron 850, MED-ukk.alloc is actually faster thanour cache-oblivious MED-CO on the triple 2, which has thesmallest alignment cost, but because of its cubic spacedependence on alignment cost, it is unable to run tocompletion on the other triples. The more space-efficientMED-ukk.checkp runs to completion on all triples, but itsperformance degrades with the alignment cost, and evenfor triple 2, it runs about 20 percent slower than MED-CO.

We ran the same triples on Opteron 850 with mismatchcost set to 2 instead of 1, and we summarize the results here.This increases the alignment cost, and as expected, the twoUkkonen-based methods degraded significantly: MED-ukk.alloc did not complete on any of the five triples, and

MED-ukk.checkp ran 5 to 10 times slower than MED-CO.Also, as expected, the runtimes for MED-Knudsen andMED-CO were virtually unchanged from the timings inTable 5.

Overall, our experimental results suggest that MED-COis always a better choice than MED-Knudsen, and a betterchoice than the two Ukkonen-based algorithms (MED-ukk.alloc and MED-ukk.checkp) except when the alignmentcost is very small.

Effects of space-reduction and cache-efficiency. In Fig. 7,we plot the running times on triples of random sequences ofKnudsen’s algorithm (MED-Knudsen), our implementationof a Hirschberg-style space-reduced version of the samealgorithm (MED-H), and our space-efficient cache-obliviousalgorithm (MED-CO). As the plots show, after simplyreducing the space usage from Oðn3Þ (MED-Knudsen) toOðn2Þ (MED-H), the median algorithm runs faster and can


Fig. 7. Improvements in the performance of a median algorithm (i.e., MED-Knudsen: Knudsen’s implementation of his algorithm [28]) as its spacerequirement is reduced (with a Hirschberg-style implementation of Knudsen’s algorithm (MED-H)), and as both its space usage and cacheperformance are improved using our cache-oblivious median algorithm (MED-CO). Each data point is the average of three independent runs onrandom strings over fA; T;G;Cg.

TABLE 5Median on Real Data: The Triplets Were Formed by Choosing Random Sequences of Length Less Than 500

Each number outside parentheses in columns 4-7 is the time for a single run, and the ratio of that running time to the corresponding running time forMED-CO is given within parentheses. A “�” in a column denotes that the corresponding algorithm could not be run due to high space overhead.

handle much longer sequences. For space-intensive algo-

rithms, reducing space usage can improve its cache

performance significantly since now the data fits in lower

cache levels and thus incurs fewer cache misses.Although we were able to improve the performance of

Knudsen’s algorithm significantly by reducing its space

usage to Oðn2Þ, we observe that our cache-oblivious, space-

efficient algorithm for median has better performance. OnAMD Opteron, the running time of Knudsen’s algorithm

improves by 40 percent after space-reduction (MED-H), andour cache-oblivious implementation (MED-CO) gives afurther 30 percent improvement. A similar trend isobserved on Intel P4 Xeon.

3.3 RNA Secondary Structure Prediction withPseudoknots

We implemented the algorithms in Table 6 for computingall values of Spseudo or S0pseudo (i.e., we compute the optimalscores only, we do not traceback the pseudoknots). We ranall experiments on Intel Xeon using a single processor.

Overall, RNA-QS ran about 50 percent slower than RNA-CO and RNA-CS ran up to seven times slower than RNA-COfor sequence lengths it could handle. For sequences longerthan 512 RNA-CS could not be run due to lack of memoryspace. We summarize our results below.

Random sequences. We ran all three algorithms onrandomly generated string-pairs over fA;U;G;Cg (seeFig. 8). The lengths of the strings were varied from 64 to2,048. However, due to lack of space RNA-CS could notbe run for strings longer than 512. In our experiments,RNA-CO ran the fastest while RNA-CS was the slowest.Though both RNA-QS and RNA-CS have the same timeand cache complexity, RNA-QS ran significantly fasterthan RNA-CS (e.g., 4:5 times faster for length 512). Webelieve this happened because even for small sequencelengths RNA-CS overflows the L2 cache, and most of itsdata reside in the slower RAM, while both RNA-QS andRNA-CO still work completely inside the faster L2 cache.For strings of length 512, RNA-CO ran about 35 percentfaster than RNA-QS and about seven times faster thanRNA-CS. The performance of RNA-CO improved overthat of both RNA-CS and RNA-QS as the length increased.

Real-world sequences. We ran all three implementations ona set of 24 bacterial 5S rRNA sequences obtained from [17].The average length of the sequences was 118, and theaverage running times of RNA-CS, RNA-QS, and RNA-COon each sequence were 1.46, 0.45, and 0.35 seconds,respectively. We also ran RNA-QS and RNA-CO on a setof 10 bacterial (spirochaetes) 16S rRNA sequences of averagelength 1,509. The RNA-CS implementation could not be runon these sequences due to space limitations. On these


TABLE 6Algorithms for RNA Secondary Structure Prediction Used

in Our Experiments

Fig. 8. RNA secondary structure prediction with simple pseudoknots on

random data: All runtimes are normalized w.r.t. RNA-CO. Each data

point is the average of three independent runs on random strings over

fA;U;G;Cg.

TABLE 7RNA Secondary Structure Prediction with Simple Pseudoknots on Real Data:

Inputs Are Bacterial (Spirochaetes) 16S rRNA Sequences [7] with an Average Length of 1,509

Each number in columns 3 and 4 represents time for a single run.

sequences, RNA-CO took 1 hours 38 minutes, while RNA-QS took 2 hours 38 minutes on the average (see Table 7).

4 CONCLUSION

In this paper, we have presented a general cache-obliviousframework for a class of widely encountered dynamicprogramming problems with local dependencies, andapplied it to obtain efficient cache-oblivious algorithmsfor three important string problems in bioinformatics,namely global pairwise sequence alignment and median(both with affine gap costs), and RNA secondary structureprediction with simple pseudoknots. We have shown thatour algorithms are faster, both theoretically and experi-mentally, than previous algorithms for these problems.

Our framework can be applied to several other dynamicprogramming problems in bioinformatics including localalignment, generalized global alignment with intermittentsimilarities, multiple sequence alignment under severalscoring functions such as “sum-of-pairs” objective function,and RNA secondary structure prediction with simplepseudoknots using energy functions based on adjacent basepairs.

ACKNOWLEDGMENTS

This work was supported in part by the US National ScienceFoundation (NSF) Grant CCF-0514876 and NSF CISEResearch Infrastructure Grant EIA-0303609. The authorsthank Mike Brudno for the CFTR DNA sequences, RobinGutell for the rRNA sequences, and David Zhao for theMEDKnudsen, MED-ukk.alloc, and MED-ukk.checkp code.Preliminary versions of Sections 2.1 and 2.3 were presentedat SODA ’06.

REFERENCES

[1] A. Aggarwal and J. Vitter, “The Input/Output Complexity ofSorting and Related Problems,” Comm. ACM, vol. 31,pp. 1116-1127, 1988.

[2] A. Aho, D. Hirschberg, and J. Ullman, “Bounds on the Complexityof the Longest Common Subsequence Problem,” J. ACM, vol. 23,no. 1, pp. 1-12, 1976.

[3] T. Akutsu, “Dynamic Programming Algorithms for RNA Second-ary Structure Prediction with Pseudoknots,” Discrete AppliedMath., vol. 104, pp. 45-62, 2000.

[4] S. Altschul and B. Erickson, “Optimal Sequence AlignmentUsing Affine Gap Costs,” Bull. Math. Biology, vol. 48, pp. 603-616,1986.

[5] A. Apostolico, S. Browne, and C. Guerra, “Fast Linear-SpaceComputations of Longest Common Subsequences,” TheoreticalComputer Science, vol. 92, no. 1, pp. 3-17, 1992.

[6] L. Bergroth, H. Hakonen, and T. Raita, “A Survey of LongestCommon Subsequence Algorithms,” Proc. Seventh StringProcessing and Information Retrieval (SPIRE ’00), pp. 39-48,2000.

[7] J. Cannone, S. Subramanian, M. Schnare, J. Collett, L. D’Souza,Y. Du, B. Feng, N. Lin, L. Madabusi, K. Muller, N. Pande,Z. Shang, N. Yu, and R. Gutell, “The Comparative RNA Web(CRW) Site: An Online Database of Comparative Sequence andStructure Information for Ribosomal, Intron, and Other RNAs,”BMC Bioinformatics, vol. 3, no. 2, http://www.rna.icmb.utexas.edu/, 2002.

[8] R. Chowdhury, “Algorithms and Data Structures for Cache-Efficient Computation: Theory and Experimental Evaluation,”PhD thesis, Dept. of Computer Sciences, Univ. of Texas at Austin,2007.

[9] R. Chowdhury, H. Le, and V. Ramachandran, “Efficient Cache-Oblivious String Algorithms for Bioinformatics,” TechnicalReport TR-07-03, Dept. of Computer Sciences, Univ. of Texasat Austin, Feb. 2007.

[10] R. Chowdhury and V. Ramachandran, “Cache-ObliviousDynamic Programming,” Proc. 17th Ann. ACM-SIAM Symp.Discrete Algorithms (SODA ’06), pp. 591-600, 2006.

[11] R. Chowdhury and V. Ramachandran, “The Cache-ObliviousGaussian Elimination Paradigm: Theoretical Framework andExperimental Evaluation,” to appear in Theory of ComputingSystems (Special Issue for SPAA ’07), 2010, preliminary versionappeared in Proc. 19th ACM Symp. Parallelism in Algorithms andArchitectures (SPAA ’07), pp. 71-80, 2007.

[12] R. Chowdhury and V. Ramachandran, “Cache-Efficient Dy-namic Programming Algorithms for Multicores,” Proc. 20thACM Symp. Parallelism in Algorithms and Architectures (SPAA’08), pp. 207-216, 2008.

[13] R. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran,“Oblivious Algorithms for Multicores and Network of Proces-sors,” to appear in Proc. 24th IEEE Int’l Parallel and DistributedProcessing Symp. (IPDPS ’10), 2010.

[14] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction toAlgorithms, second ed. The MIT Press, 2001.

[15] M. Crochemore, G. Landau, and M. Ziv-Ukelson, “A Subqua-dratic Sequence Alignment Algorithm for Unrestricted ScoringMatrices,” SIAM J. Computing, vol. 32, no. 6, pp. 1654-1673,2003.

[16] T. DeSantis, I. Dubosarskiy, S. Murray, and G. Andersen,“Comprehensive Aligned Sequence Construction for AutomatedDesign of Effective Probes (Cascade-P) Using 16S rDNA,”Bioinformatics, vol. 19, pp. 1461-1468, http://greengenes.llnl.gov/16S/, 2003.

[17] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological SequenceAnalysis. Cambridge Univ. Press, 1998.

[18] M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran, “Cache-Oblivious Algorithms,” Proc. 40th Ann. IEEE Symp. Foundations ofComputer Science (FOCS ’99), pp. 285-297, 1999.

[19] M. Frigo and V. Strumpen, “Cache-Oblivious Stencil Compu-tations,” Proc. 19th ACM Int’l Conf. Supercomputing (ICS ’05),pp. 361-366, 2005.

[20] O. Gotoh, “An Improved Algorithm for Matching BiologicalSequences,” J. Molecular Biology, vol. 162, pp. 705-708, 1982.

[21] J. Grice, R. Hughey, and D. Speck, “Reduced Space SequenceAlignment,” Computer Applications in the Biosciences, vol. 13, no. 1,pp. 45-53, 1997.

[22] D. Gusfield, Algorithms on Strings, Trees and Sequences. CambridgeUniv. Press, 1997.

[23] D. Hirschberg, “A Linear Space Algorithm for ComputingMaximal Common Subsequences,” Comm. ACM, vol. 18, no. 6,pp. 341-343, 1975.

[24] D. Hirschberg, “An Information Theoretic Lower Bound for theLongest Common Subsequence Problem,” Information ProcessingLetters, vol. 7, no. 1, pp. 40-41, 1978.

[25] J. Hong and H. Kung, “I/O Complexity: The Red-Blue PebbleGame,” Proc. 13th Ann. ACM Symp. Theory of Computation(STOC ’81), pp. 326-333, 1981.

[26] J. Kleinberg and E. Tardos, Algorithm Design. Addison-Wesley,2005.

[27] B. Knudsen, Multiple Parsimony Alignment with “affalign”, softwarepackage multalign.tar, 2008.

[28] B. Knudsen, “Optimal Multiple Parsimony Alignment withAffine Gap Cost Using a Phylogenetic Tree,” Proc. ThirdWorkshop Algorithms in Bioinformatics (WABI ’03), pp. 433-446,2003.

[29] S. Kumar and C. Rangan, “A Linear-Space Algorithm for the LCSProblem,” Acta Informatica, vol. 24, pp. 353-362, 1987.

[30] H. Le, “Algorithms for Identification of Patterns in Biogeographyand Median Alignment of Three Sequences in Bioinformatics,”undergraduate honors thesis, Dept. of Computer Sciences, Univ.of Texas at Austin, CS-TR-06-29, 2006.

[31] R. Lyngsø and C. Pedersen, “RNA Pseudoknot Prediction inEnergy-Based Models,” J. Computational Biology, vol. 7, no. 3/4,pp. 409-427, 2000.

[32] D. Maier, “The Complexity of Some Problems on Subsequencesand Supersequences,” J. ACM, vol. 25, no. 2, pp. 322-336, 1978.


[33] W. Masek and M. Paterson, “A Faster Algorithm for ComputingString Edit Distances,” J. Computer and System Sciences, vol. 20,no. 1, pp. 18-31, 1980.

[34] E. Myers and W. Miller, “Optimal Alignments in Linear Space,”Computer Applications in the Biosciences, vol. 4, no. 1, pp. 11-17,1988.

[35] W. Pearson and D. Lipman, “Improved Tools for BiologicalSequence Comparison,” Proc. Nat’l Academy of Sciences USA,vol. 85, pp. 2444-2448, 1988.

[36] C. Pedersen, “Algorithms in Computational Biology,” PhD thesis,Dept. of Computer Science, Univ. of Aarhus, 1999.

[37] D. Powell, Software Package, align3str_checkp.tar.gz, 2008.[38] D. Powell, L. Allison, and T. Dix, “Fast, Optimal Alignment of

Three Sequences Using Linear Gap Cost,” J. Theoretical Biology,vol. 207, no. 3, pp. 325-336, 2000.

[39] E. Rivas and S. Eddy, “A Dynamic Programming Algorithm forRNA Structure Prediction Including Pseudoknots,” J. MolecularBiology, vol. 285, no. 5, pp. 2053-2068, 1999.

[40] J. Seward and N. Nethercote, Valgrind (Debugging and ProfilingTool for x86-Linux Programs), http://valgrind.kde.org/index.html,2008.

[41] J. Thomas et al., “Comparative Analyses of Multi-SpeciesSequences from Targeted Genomic Regions,” Nature, vol. 424,pp. 788-793, 2003.

[42] M. Waterman, Introduction to Computational Biology. Chapman andHall, 1995.

Rezaul Alam Chowdhury received the BSdegree in computer science and engineeringfrom Bangladesh University of Engineering andTechnology and the PhD degree in computerscience from the University of Texas, Austin(UT Austin), in 2007. He is currently a post-doctoral fellow in the Center for ComputationalVisualization, Institute for Computational Engi-neering and Sciences, UT Austin. His researchinterests include design and analysis of combi-

natorial algorithms and data structures, bioinformatics and computa-tional biology, algorithms for massive data sets, cache-efficientalgorithms, and parallel and multicore computing.

Hai-Son Le received the BS degree in computersciences from the University of Texas in 2006.He is a graduate of the Turing ScholarsProgram. He was with Google Inc., MountainView, California. In Fall 2008, he joined the PhDprogram in the Department of Machine Learning,School of Computer Science, Carnegie MellonUniversity, Pittsburgh. His research interestsinclude data structures, algorithm design, com-putability, machine learning, and artificial intelli-

gence. He is a student member of the ACM, the IEEE, and Phi BetaKappa.

Vijaya Ramachandran received the PhD de-gree from Princeton University. She is a profes-sor of computer science in the Department ofComputer Sciences, University of Texas, Austin(UT-Austin). She was on the faculty of theUniversity of Illinois at Urbana-Champaign priorto UT-Austin. Her research interests includetheoretical computer science in the areas ofalgorithms and data structures, graph theory,and parallel and multicore computing. She

serves on the editorial boards of the Journal of the ACM (JACM) andthe ACM Transactions on Algorithms (TALG).

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Download - Cache-Oblivious Dynamic Programming for Bioinformatics

Top Related