Transcript
Page 1: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

Math.Comput.Sci. (2010) 3:433–442DOI 10.1007/s11786-010-0037-2 Mathematics in Computer Science

Cache Oblivious Algorithms for the RMQand the RMSQ Problems

Masud Hasan · Tanaeem M. Moosa ·M. Sohel Rahman

Received: 6 July 2009 / Revised: 17 January 2010 / Accepted: 15 February 2010 / Published online: 13 April 2010© Birkhäuser / Springer Basel AG 2010

Abstract In the Range Minimum/Maximum Query (RMQ) and Range Maximum-Sum Segment Query (RMSQ)problems, we are given an array which we can preprocess in order to answer subsequent queries. In the RMQ query,we are given a range on the array and we need to find the maximum/minimum element within that range. On theother hand, in RMSQ query, we need to return the segment within the given query range that gives the maximumsum. In this paper, we present cache oblivious optimal algorithms for both of the above problems. In particular, forboth the problems, we have presented linear time data structures having optimal cache miss. The data structurescan answer the corresponding queries in constant time with constant cache miss.

Keywords Algorithms · Cache oblivious algorithms · Range minima/maxima query ·Range maximum-sum segment query

Mathematics Subject Classification (2000) Primary 68W32; Secondary 68R15

1 Introduction

In this paper, we study the Range Maxima/Minima Query (RMQ) and the Range Maximum-Sum SegmentQuery (RMSQ) problems. In the Range Minimum Query problem, we are given an array A[1 . . . n], which

This research is part of the B.Sc. Engg. Thesis work of Tanaeem M. Moosa.

M. Hasan · T. M. Moosa ·M. S. Rahman (B)Department of Computer Science and Engineering,Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladeshe-mail: [email protected]; [email protected]

M. Hasane-mail: [email protected]

T. M. Moosae-mail: [email protected]

M. S. RahmanAlgorithm Design group, Department of Computer Science,King’s College London, London, UK

Page 2: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

434 M. Hasan et al.

can be preprocessed; we have to answer queries of the form RM Qmin(A, i, j) such that i ≤ j whereRM Qmin(A, i, j) = mini≤k≤ j A[k]. The query RM Qmax (A, i, j) is defined analogously. In the RMSQ prob-lem, on the other hand, we are given an array A[1 . . . n] of (not necessarily positive) real numbers. Again, the arrayA can be preprocessed and we have to answer queries of the form RM SQuery(A, i, j), satisfying 1 ≤ i ≤ j ≤ n,where RM SQuery(A, i, j) = (x, y), such that i ≤ x ≤ y ≤ j and sum of the segment A[x ..y] is maximized.

RMQ problem has received much attention in the literature not only because it is inherently beautiful from analgorithmic point of view, but also because efficient solution to this problem often produces efficient solutions to var-ious other algorithmic problems. For example, RMQ problem has recently been used to devise efficient algorithmsfor different variants of pattern matching problems (e.g., [7,8,19]).

To the best of our knowledge, it all started with Harel and Tarjan [14], who showed how to solve anotherinteresting problem, namely the Lowest Common Ancestor (LCA) problem1 with linear time preprocessing andconstant time queries. It turns out that, solving the RMQ problem is equivalent to solving the LCA problem [13].The Harel–Tarjan algorithm was simplified by Schieber and Vishkin [22] and then by Berkman et al. [4], whopresented optimal work parallel algorithms for the LCA problem. Finally, Bender and Farach-Colton eliminated theparallelism mechanism and presented a simple serial data structure [2]. Notably, the data structure in [2] requiresO(n log n) bits of space. Recently, Sadakane [21] presented a succinct data structure, which achieves the sametime complexity using O(n) bits of space. In all of the above papers, there is an interplay between the LCA andthe RMQ problems. Very recently, Fischer and Heun [15] presented the first algorithm for the RMQ problem withlinear preprocessing time, optimal 2n+ o(n) bits of additional space, and constant query time that makes no use ofthe LCA algorithm.

On the other hand, the RMSQ problem has its origin in the problem of finding the maximum-sum segmentof a given number sequence, which plays an important role in sequence analysis. Bentley’s linear-time algorithm[3] for finding such a segment is by now a folklore example considered in algorithm classes. There are manykinds of variants of the maximum-sum segment problem that impose extra constraints on the input or on theoutput [1,5,11,16–18,24]. The RMSQ problem was studied very recently by Chen and Chao in [6], where theyshowed that the RMSQ problem is linearly equivalent to the RMQ problem.

The goal of this paper is to study these two interesting problems in the Cache Oblivious model. There has notbeen any work in the literature on RMQ under the CO model except for a very recent work of Demaine et al. [10]2.A comparison of our result with the work of [10] is provided in Sect. 3.1. On the other hand, to the best of ourknowledge, this is the first attempt to study the RMSQ problem under CO model.

In what follows, we will use the following convention introduced in [2]. If a solution to the RMQ (or, RMSQ)problem has preprocessing time (or cache miss) f (n) and query time (or query cache miss) g(n), then we will saythat the algorithm has time complexity (cache complexity) 〈 f (n), g(n)〉. Our contribution in this paper is twofold.

1. Firstly, we present an efficient cache oblivious algorithm for the RMQ problem. In particular, we present an〈O(n), O(1)〉 algorithm for the RMQ problem which has 〈O(n/B), O(1)〉 cache miss. We achieve the aboveresult by doing the following:

• We first consider an 〈O(n log n), O(1)〉 algorithm to solve the RMQ problem which is referred to as theSparse Table (ST) algorithm in the literature. The ST algorithm is used as a basic block in the 〈O(n), O(1)〉algorithm of Fischer and Heun [15] (referred to as FH Algorithm henceforth). We first show how STalgorithm can be implemented efficiently (〈O( n

B log n),O(1)〉 cache miss) in Cache Oblivious model.• Then, we consider the FH algorithm. We first show that the FH algorithm is not cache efficient and identify

the components responsible (apart from the ST algorithm). Then we show how to make those componentscache efficient doing some non-trivial modifications.

1 In the LCA problem the query gives two nodes in a given tree and requests the lowest common ancestor of the two nodes.2 We came to know about the work of [10] during the preparation of this manuscript when it was yet to be presented at ICALP 2009.

Page 3: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

Cache Oblivious Algorithms for the RMQ and the RMSQ Problems 435

2. Our second contribution is a cache oblivious efficient algorithm for the RMSQ problem. In particular, we willmodify the existing 〈O(n), O(1)〉 algorithm of Chen and Chao [6] (referred to as C2 Algorithm henceforth) tomake it cache oblivious.

2 Preliminaries

The cache-oblivious (CO) model is a generalization of the External Memory (EM) model [23], which approximatesthe memory hierarchy by modeling two levels, with the inner level having size M , the outer level having infinite size,and transfers between the levels taking place in blocks of B consecutive elements. The cost measure of an algorithmis the number of memory transfers, or I/Os, it makes. The CO model, introduced by Frigo et al. [12], elegantlygeneralizes the EM model to a multi-level memory model by a simple measure: the algorithm is not allowed toknow the value of B and M . Cache replacement is assumed to take place automatically by an optimal off-line cachereplacement strategy and the cache is assumed to be fully associative.3 Since the analysis holds for any B and M ,it holds for all levels simultaneously. Readers are referred to [12] for the full details of the cache-oblivious model.

We will utilize the Tall Cache Assumption, a frequently used assumption (usually true in practice) of the COmodel, which states that M = �(B2). We will also need the following results.

Lemma 1 ([9]) In CO model, linear traversal of an N element array requires optimal O( NB ) cache misses and

simultaneous traversal of k arrays, each of size at most N , requires optimal O( k NB ) cache misses as long as M ≥ k B.

Additionally, in this model, stack, implemented with an array and a top pointer, has optimal O( NB ) cache misses

where N is number of operations (push, pop) on the stack.

3 Cache Oblivious RMQ

We first discuss an 〈O(n log n), O(1)〉 algorithm for solving the RMQ problem, which is referred to as the ST(sparse table) algorithm henceforth. The ST algorithm is used as a building block in a number of solutions to theRMQ problem in the literature. The basic idea in the ST algorithm is to pre-compute answers for each of the rangeswhose length is a power of two. For all 1 ≤ i ≤ n, and for all 0 ≤ j ≤ �log n�, the position of minimum value ofA[i . . . i + 2 j − 1] is precomputed and kept in an array M S[ j][i]. The array needs O(n log n) space, and can befilled in O(n log n) time using the following recurrence.

M S[ j][i] =

⎧⎪⎨

⎪⎩

i if j = 0

M S[ j − 1][i] if A[M S[ j − 1][i]]] ≤ A[M S[ j − 1] [i + 2 j−1]]M S[ j − 1][i + 2 j−1] otherwise

(1)

This array can be filled sequentially, and by keeping an additional array, which stores the minimum value of therange A[i . . . i+2 j−1], we can make sure that all other memory accesses are also sequential. Therefore, cache missfor this algorithm will be optimal O(

(Total memory accessed)B ) = O(

n log nB ). To answer a query RM Q(A, i, j) in O(1),

we can select two possibly overlapping segments that exactly cover the range [i, j], and return the position where thesmaller value occurs. More formally, RM Q(A, i, j) = mink∈{M S[l][i],M S[l][i+2l ]}{A[k]}, where l = �log( j − i)�.Time complexity and cache miss for query is O(1).

Now we briefly review the algorithm by Fischer and Heun [15] (referred to as the FH algorithm henceforth).Given an array A[l . . . r ], the Cartesian tree C(A[l . . . r ]), is a binary tree whose root is A[i], the minimum4 elementof A[l . . . r ]. If i > l (resp. i < r ), the left (resp. right) subtree of this root is C(A[l . . . i−1]) (resp. C(A[i+1 . . . r ]));3 Justification of the above assumptions can be found in [9,12,20]: the basic idea is that if cache miss of an algorithm with theseassumptions is f (n), then for real cache it will be O( f (n)).4 To ensure uniqueness, if there are equal elements, we consider the one with smaller index to be smaller.

Page 4: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

436 M. Hasan et al.

Algorithm 1: An Algorithm to compute the type of a block [15]Input: block B j of size sOutput: type(B j )

1.1 let rp be an array of size s + 1;1.2 r p[1] ← −∞;1.3 q ← s, N ← 0;1.4 for i ← 1 to s do1.5 while r p[q + i − s] > B j [i] do1.6 N ← N + C(s−i)q ;1.7 q ← q − 1;1.8 end1.9 r p[q + i + 1− s] ← B j [i];

1.10 end1.11 return N

otherwise there is no left (resp. right) subtree. Clearly, Cartesian tree can be defined for maximum elements anal-ogously. Now, the FH algorithm utilizes the fact that two arrays will have the same answers for the RMQ queriesif they correspond to the same Cartesian tree. The FH algorithm first divides the array A into s size blocks, B1,B2 …Bn/s . For each block, the minimum of that block is computed in an array A′[1 . . . n/s], and position of thisminimum is kept in an array B[1 . . . n/s]. Then, ST algorithm is used to preprocess array A′. A type is calculated(using Algorithm 1 [15]) for each block, such that the type of two blocks will be same iff they correspond tothe same cartesian tree. The RMQ results for all possible ranges for each different type5 are preprocessed andkept in the array P . Since the number of binary trees with n nodes is Cn , the n-th Catalan number, FH needs toprecompute at most Cs , i.e., O( 4s

s3/2 ) different blocks. Since, O(s2) time and memory is needed for each precompu-tation, the total time and space complexity would be O(4s√s). And, since the preprocessing time of ST algorithmis O(n log n), total precomputation cost is O(4s√s + n

s log n). If s = log n/4 then the time complexity will beO(4log n/4√log n/4 + n

log n/4 log n) or O(√

n√

log n/4 + 4n), which is O(n). Now any query RM Q(A, i, j) canbe answered using at most three sub-queries, namely, two in-block queries and one block query.

In [15], Fischer and Heun did not analyze the algorithm under the CO model. In what follows, we present adetailed analysis for the cache complexity of the FH algorithm. In particular, we will highlight the steps of the FHalgorithm which are responsible for its suboptimal performance under CO model. In the sequel, we will show howto modify these steps to achieve cache obliviousness. Now we analyze the cache miss of the FH algorithm. As thequery needs O(1) time, it also needs O(1) cache miss. For preprocessing, we consider each step separately.

• Partitioning into blocks, calculating the minimum of blocks, and using the ST preprocessing algorithm on A′requires O(n/B) cache miss since all of these require only linear memory accesses.

• To calculate the type of a block, Algorithm 1 uses the ballot numbers, C pq defined by the following recurrence.

C00 = 1, C pq = C(p−1)q + C p(q−1), if 0 ≤ p ≤ q = 0 , and C pq = 0 otherwise. (2)

It can be shown that C pq is equal to q−p+1q+1

(p+qq

)implying that Css is the s-th Catalan number. In the type

calculation, the r p array (Algorithm 1) works as a stack; hence cache miss due to the r p array is O(n/B). Thearray B j is also accessed linearly; hence, cache miss from this array is also O(n/B). But the value of C(s−i)q

is not linearly accessed. It can be shown that cache miss for accessing this table is O(n/√

B) in the worst case.• Cache miss for the precomputation step can be O(n/ log n) in the worst case as follows. After calculating the

type, we have to check whether result of this type is already calculated before. As the value of the type can comein random order, each of these checks may require O(1) cache miss. Since these checks are done O(n/ log n)

times, in total, the cache miss will be O(n/ log n).

Now we show how to achieve the optimal cache O(n/B) cache miss by modifying the FH algorithm. Recall that, theFH algorithm fails to achieve the optimal O(n/B) cache miss firstly, due to its requirement to access C pq for type

5 Note that, several blocks may belong to the same type.

Page 5: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

Cache Oblivious Algorithms for the RMQ and the RMSQ Problems 437

Algorithm 2: Cache oblivious RMQ preprocessingInput: An array A[1 . . . n]Result: Two arrays,P[0 . . . Cs ][1 . . . s][1 . . . s], T [1 . . . n/s] and ST-RMQ precomputation of minimum elements of each block

2.1 s ← log n4 ;

2.2 let rp is an array size s + 1;2.3 let v is an array of size n/s + 1;2.4 v[0] ← {−∞,−1};2.5 for i ← 1 to n/s do2.6 r p[1] = −∞;2.7 q ← s, N ← 0;2.8 C ← s-th Catalan number;

/* Calculated using Cs = 1s+1

(2ss

)*/

2.9 for j ← 1 to s do2.10 k ← (i − 1)s + j ;2.11 while r p[q + j − s] > A[k] do2.12 N ← N + C ;2.13 C ← f 2(C, s − j, q);2.14 q ← q − 1;2.15 end2.16 r p[q + i + 1− s] ← A[k];2.17 C ← f 1(C, s − j, q);2.18 end2.19 T [i] ← N ;2.20 v[i] ← {N , i};2.21 calculate minimum of A[(i − 1)s + 1 . . . i ∗ s] and store it in B[i];2.22 end2.23 Preprocess B[1 . . . n/s] for RMQ using Algorithm ST pre-processing;2.24 Sort v[1 . . . n/s] by first element;2.25 for i ← 1 to n/s do2.26 {pr, j1} ← v[i − 1];2.27 {cur, j} ← v[i];2.28 if pr = cur then2.29 Pre calculate result of RMQ(i1,i2) for all 1 ≤ i1 ≤ i2 ≤ s for the block A[( j − 1)s + 1 . . . j ∗ s], and store it in

P[cur ][i1][i2] ;2.30 end2.31 end

calculation and subsequently, because of the required check to see whether this type has already occurred (and hencethe calculation is redundant). Therefore, we need to modify these two steps. The modified complete preprocessingalgorithm is presented in Algorithm 2. To calculate the type of a block, our algorithm uses two functions f 1 andf 2 as defined below:

f 1(C, p, q) = (q − p + 2)p

(q − p + 1)(p + q)C (3)

f 2(C, p, q) = (q − p)(q + 1)

(q − p + 1)(p + q)C (4)

It is easy to see that C(p−1)q = f 1(C pq , p, q) and C p(q−1) = f 2(C pq , p, q). We now prove that our algorithm fortype calculation is correct.

Lemma 2 The type of a block calculated by Algorithm 2 is correct.

Proof We show that the type calculated by Algorithm 2 is exactly the same as calculated by Algorithm 1. Notethat, instead of directly using C(s−i),q , as is done in line 1.6 of Algorithm 1, in our algorithm, we have used a

Page 6: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

438 M. Hasan et al.

variable C . Initially Css = C(s−1)s . C is changed whenever the values of i and q are changed and it is calculatedusing the functions f 1 and f 2 to ensure that the value of C at line 2.12 of Algorithm 2 is C(s−i),q (in our case,C(s− j),q , to be specific). Hence the result follows. ��

We store the types and the positions of the occurrences of that type into an array v. We sort the v array by type.This ensures that the access by type will be linear. However, the access of the array A to calculate all possibleanswers will no longer be linear anymore. But we will prove that, if the tall cache assumption holds, this memoryaccess will still cost O(n/B) cache miss.

Theorem 3 Algorithm 2 is correct, runs in O(n) time and costs O(n/B) cache miss in cache oblivious model withtall cache assumption.

Proof The correctness of Algorithm 2 follows from Lemma 2 and the correctness of FH algorithm [15]. Below, wededuce the time complexity and cache complexity of the algorithm.

Time Complexity.: Consider an iteration of the first for loop at Line 2.5. The calculation of C can be done in O(s).The array r p works like a stack. Each of the s elements are pushed once in the stack and popped at most once.So total stack operation is O(s).Since the number of iteration is n/s, the time complexity of the first for loop (starting at Line 2.5) is O(n).The RMQ preprocessing with the ST algorithm for B has time complexity O(n/s log n

s ), which is O(n) since

s = log n4 . In the same way, sorting the array v of n/s elements can also be done in O(n). For the second for loop

(starting at Line 2.25), the if condition (at Line 2.28) is satisfied at most Cs times as there can be Cs differentvalues of type. The calculation of Line 2.29 can be done in O(s2). So the time complexity of this for loop isO(Css2) = O(4s√s) = O(

√n log n).

Cache Complexity.: In the first for loop (starting at Line 2.5) four arrays, namely, A, r p, T and v are accessed.Now, A, T and v are accessed sequentially while r p acts as a stack. Therefore, cache miss for this loop is O( n

B ).

Cache complexity for the RMQ preprocessing of array B using the ST Algorithm is O(n/sB log n/s) = O( n

B ).

Cache complexity of sorting is O(n/sB log2

n/sB ), if merge sort is used, while, it will be O(

n/sB logM/B

n/sB ) if

cache oblivious sorting is used. Since s = log n4 , in both of the cases the cache complexity would be O( n

B ).Now, for the second loop (starting at Line 2.25) the array v is linearly accessed. So cache miss excluding the ifcondition (at Line 2.28) is O(

n/sB ) = O( n

B ). The if condition is satisfied at most Cs = O( 4s

s3/2 ) times. Insidethe if condition, the P and A arrays are accessed. The P array is linearly accessed since v is sorted. So, cachemiss for P is O(Size of P

B ) = o( nB ) since the size of P is O(

√n log n). Now, as is mentioned before, the access

of the array A is not linear in our algorithm. To count cache miss due to the access of A, we need to considertwo cases.

Case 1: B ≤ s. : In this case, each time the if condition is satisfied, s elements of A are accessed linearly which

amounts to a cache miss of O(s/B). So the total cache miss is O(

4s

s(1/2) B

)= O

(√n

log n

B

)

= o( nB ).

Case 2: B > s. : In this case, there will not be more than one cache miss when the if condition is satisfied.So we have a cache miss of O

(4s

s3/2

)= O

( √n

log3/2 n

). It remains to show that O

( √n

log3/2 n

)= O

( nB

). Now,

recall that, by the tall cache assumption we have M = �(B2). Now we have two cases as follows.Case 2a: M = �(n log n). As the total memory used by our algorithm is O(n), no cache replacement will

be necessary and hence we will have just only one cache miss per memory block. Therefore, the totalcache miss will be Total memory used

B = O( nB ).

Case 2b: M = o(n log n). As B2 = O(M), we have B = o(√

n log n). So O( nB ) dominates the O(

√n

log3/2 n)

term.So, either way, the cache miss is O( n

B ) and hence the result follows. ��

Page 7: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

Cache Oblivious Algorithms for the RMQ and the RMSQ Problems 439

Algorithm 3: Compute CPM [6]Input: An array A of size nResult: An array C of length n + 1 ant two array P and M of length n

3.1 C[0] ← 0;3.2 for i ← 1 to n do3.3 C[i] ← C[i − 1] + A[i];3.4 L[i] ← i − 1;3.5 P[i] = i ;3.6 while C[L[i]] < C[i] and L[i] > 0 do3.7 if C[P[L[i]] − 1] < C[P[i] − 1] then P[i] ← P[L[i]];3.8 L[i] ← L[L[i]];3.9 end

3.10 M[i] ← C[i] − C[P[i] − 1];3.11 end

3.1 Comparison with Demaine et al.’s [10] Work

Very recently in [10], Demaine et al presented a cache oblivious data structure for the RMQ problem. The authorsin [10] also followed a similar strategy and modified the FH algorithm to achieve their result. Recall that, themain reason of the non-optimal cache miss in the FH algorithm can be attributed to the construction of the lookuptable P . As the value of type can occur randomly, the checking, whether pre-computation of the correspondingblock is necessary, may cause non-optimal cache miss. To avoid this, we have sorted the types and accessed Plinearly. However, this strategy may cause random access of the array A; but we have proved that the cache misscaused by this non-linear access remains o(n/B). Demaine et al. employed a different technique to achieve overalloptimal cache miss. They avoided random access by changing the encoding to a binary string so that they cangenerate a representative array for an encoding. We believe that our approach is much simpler as we don’t changethe encoding and don’t need to calculate any unnecessary lookup table entries. We further believe that our techniqueof replacing a costly (with respect to cache miss) non-linear memory access with a less costly non-linear memoryaccess may turn out to be useful in other problems as a novel technique of achieving cache obliviousness.

4 Cache Oblivious RMSQ

In this section, using the cache oblivious optimal algorithm for RMQ, we modify the algorithm of Chen and Chao[6] (referred to as the C2 algorithm henceforth) to get an optimal cache oblivious algorithm for the RMSQ problem.For our convenience, we will denote the sum of a segment A[i . . . j] by S(i, j). First, we review the 〈O(n), O(1)〉solution to the RMSQ problem i.e., the C2 algorithm [6] and analyze its cache miss. The C2 preprocessing algo-rithm first calculates the cumulative sum array C . Then for each index i , it finds an index p ≤ i such that for allp ≤ j < i , we have C[ j] < C[i] and the sum of the segment S(p, i) is maximized. The value of p is stored inan array P and the sum S(p, i) is stored in the array M . Then the algorithm preprocesses M for RM Qmax and Cfor RM Qmin . The algorithm for computing the C , P and M arrays are given in Algorithm 3. To answer a query,RM SQuery(A, i, j), C2 algorithm first finds k = RM Qmax (M, i, j). If P[k] ≥ i , it returns (P[k], k). If, on theother hand, P[k] < i , it computes k1 = RM Qmax (M, k+ 1, j), k2 = RM Qmin(C, i − 1, k− 1)+ 1 and betweenthe two ranges, [k2..k] and [P[k1]..k1], it returns the one that maximizes the segment sum. For a detail discussionof the C2 algorithm (both preprocessing and the query) and correctness thereof, we refer the readers to [6].

We now analyze the cache miss for this algorithm. As query is O(1) so cache miss will also be O(1). Thepreprocessing algorithm involves two RMQ preprocessing (M is preprocessed for RM Omax and C is preprocessedfor RM Omin). Now, if we use our cache oblivious algorithm for the RMQ (both max and min) preprocessing, thecache miss for these preprocessing steps would be bounded by O(n/B) (Theorem 3). However, overall, Algorithm3 may still require O(n) cache miss in the worst case, because the memory access is not linear in the while loop

Page 8: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

440 M. Hasan et al.

Algorithm 4: Cache Oblivious Optimal Algorithm to Compute CPMInput: An array A of size nResult: An array C of length n + 1 ant two array P and M of length n

4.1 let lst ,pst ,cst and ccst are Array of Size n + 1;4.2 top← 0;4.3 cst[0] ← 0;4.4 ccst[0] ← 0;4.5 lst[0] ← 0;4.6 pst[0] ← 0;4.7 C[0] ← 0;4.8 for i ← 1 to n do4.9 C[i] ← C[i − 1] + A[i];

4.10 L[i] ← i − 1;4.11 P[i] = i ;4.12 cc← C[i − 1];4.13 while cst[top] < C[i] and L[i] > 0 do4.14 if ccst[top] < cc then4.15 P[i] ← pst[top];4.16 cc← ccst[top];4.17 end4.18 L[i] ← lst[top];4.19 top← top − 1;4.20 end4.21 M[i] ← C[i] − cc;4.22 top← top + 1;4.23 pst[top] ← P[i];4.24 cst[top] ← C[i];4.25 lst[top] ← L[i];4.26 ccst[top] ← cc;4.27 end

(at Line 3.6). In what follows, we show how to overcome this problem and achieve optimal cache miss. We presentthe modified cache oblivious optimal algorithm to compute the arrays C , P and M in Algorithm 4. To prove itscorrectness, we first concentrate on the array L . Note that, in C2 algorithm, L[i] is used to keep the maximum indexj such that j > i and C[ j] > C[i]; if there doesn’t exist any index satisfying these conditions, then L[i] = 0. Wefirst prove that the values of L calculated by Algorithm 4 are correct, i.e., are identical to the values calculated byAlgorithm 3.

Lemma 4 The value of L calculated by algorithm 4 is correct.

Proof First, note that, the arrays cst , ccst , lst and pst are working as stacks, with top acting as their top pointer.Now we claim the following: at the start of i-th iteration of the for loop (Line 4.8), for each element k in lst[1 . . . top],either k ∈ {0, i − 1} or C[k] > C[i − 1], and the corresponding position of cst stores C[k]. Clearly, the claim istrue before the start of the for loop. Now, at the i-th iteration of the for loop, if the condition is not satisfied, weremove the corresponding elements from lst and insert the current element into lst . Note that, we don’t removeany element from lst that satisfies the condition. So, when we pair up i , all potential candidates are in the lst array.Now, i will either be paired up with i − 1 or k. In the latter case, C[i] > C[i − 1], and hence k must be either 0 orsuch that C[k] > C[i] > C[i − 1]. Now, note that all such elements are in lst and we choose L[i] to be the firstelement k which is either 0 or such that C[k] > C[i]. Hence the result follows. ��Now we have the following theorem.

Theorem 5 Algorithm 4 correctly computes the arrays C,P and M, runs in O(n) time and has O(n/B) cachemisses.

Page 9: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

Cache Oblivious Algorithms for the RMQ and the RMSQ Problems 441

Proof The correctness of the computation of C is trivial. Since the L array is correctly calculated (Lemma 4), itsuffices to show that for each segment, P and M are correctly computed. Now, note that, for all k in lst[1 . . . top],the corresponding position of pst keeps the value of P[k] and ccst keeps the value of C[P[k] − 1]. Now, P[i]actually stores index k such that C[k − 1] is minimum among all possible k, satisfying L[i] ≤ k ≤ i . Whilecalculating L[i], each time we pop an element from stack, we expand the range by adding another range. So P[i]can be updated to the correct value by comparing the current value, and the value from that range, taking the onewhich minimizes C[P[i] − 1]. This is exactly what is done in Line 4.14. So the value of P is correctly calculated.Now, note that we store C[P[i] − 1] in ccst , and whenever we change P[i], we update cc by the correspondingvalue from ccst . Therefore, the value calculated at M is correct as well.

Now we prove the time and cache complexity. We note that each element is pushed into the stack exactly onceand popped at most once. So the time complexity of this algorithm is O(n). Since all the memory access is eitherlinear or from stack, the cache miss is optimal O(n/B). ��

5 Conclusion

In this paper, we have presented cache oblivious optimal algorithms for the RMQ and RMSQ problems. For theRMQ problem we have presented an 〈O(n), O(1)〉 algorithm which has 〈O(n/B), O(1)〉 cache misses. We achievethe above result by doing some non-trivial modifications on the RMQ algorithm of Fischer and Heun [15]. Forthe RMSQ problem, we have modified the algorithm of Chen and Chao [6] to make it cache efficient. Notably, inboth the algorithms the cache miss for the query is O(1). One interesting avenue for future research could be tofocus on the amortized cache miss per query. While better amortized cache miss for RMQ query seems unlikely, forrestricted cases this might be possible. As an example, the constant length query with linearly increasing startingindex can be cited. This special case has application in the problem of Finding Longest Segment Satisfying an Aver-age Constraint. One solution to this problem makes O(n) RMQ queries. In that case, our algorithm has O(1/ log n)

amortized cache misses. An interesting question is whether O(1/B) amortized cache miss is possible in this case.

References

1. Allison, L.: Longest biased interval and longest non-negative sum interval. Bioinformatics 19(10), 1294–1295 (2003)2. Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Panario, D., Viola, A. (eds.) LATIN,

pp. 88–94. (2000)3. Bentley, J.L.: Algorithm design techniques. Commun. ACM 27(9), 865–871 (1984)4. Berkman, O., Breslauer, D., Galil, Z., Schieber, B., Vishkin, U.: Highly parallelizable problems (extended abstract). In: STOC,

pp. 309–319. ACM, New York (1989)5. Chen, K.-Y., Chao, K.-M.: Optimal algorithms for locating the longest and shortest segments satisfying a sum or an average

constraint. Inf. Process. Lett. 96(6), 197–201 (2005)6. Chen, K.-Y., Chao, K.-M.: On the range maximum-sum segment query problem. Discrete Appl. Math. 155(16), 2043–2052 (2007)7. Crochemore, M., Iliopoulos, C.S., Rahman, M.S.: Finding patterns in given intervals. In: Kucera, L., Kucera, A. (eds.) MFCS.

Lecture Notes in Computer Science, vol. 4708 pp. 645–656. Springer, Berlin (2007)8. Crochemore, M., Iliopoulos, C.S., Rahman, M.S.: Optimal prefix and suffix queries on texts. In: Jacquet, P. (ed.) AofA, DMTCS

Proc., AH, pp. 645–656 (2007)9. Demaine, E.: Cache-oblivious algorithms and data structures. In: Lecture Notes from the EEF Summer School on Massive Data

Sets (2002)10. Demaine, E.D., Landau, G.M., Weimann, O.: On cartesian trees and range minimum queries. In: Albers, S., Marchetti-Spaccamela,

A., Matias, Y., Nikoletseas, S.E., Thomas, W. (eds.) ICALP (1). Lecture Notes in Computer Science, vol. 5555, pp. 341–353.Springer, Berlin (2009)

11. Fan, T.-H., Lee, S., Lu, H.-I., Tsou, T.-S., Wang, T.-C., Yao, A.: An optimal algorithm for maximum-sum segment and its applica-tion in bioinformatics extended abstract. In: Ibarra, O.H., Dang, Z. (eds.) CIAA. Lecture Notes in Computer Science, vol. 2759,pp. 251–257. Springer, Berlin (2003)

12. Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: 40th Annual Symposium on Foundationsof Computer Science (FOCS ’99), IEEE, Washington, Brussels, Tokyo, Oct. 1999

Page 10: Cache Oblivious Algorithms for the RMQ and the RMSQ Problems

442 M. Hasan et al.

13. Gabow, H., Bentley, J., Tarjan, R.: Scaling and related techniques for geometry problems. In: Symposium on the Theory ofComputing (STOC), pp. 135–143. ACM, New York (1984)

14. Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984)15. Johannes Fischer, V.H.: A new succinct representation of RMQ-information and improvements in the enhanced suffix array.

In: Chen, B., Zhang, G. (eds.) ESCAPE. Lecture Notes in Computer Science, vol. 4614, pp. 459–470. Springer, Berlin (2007)16. Lin, Y.-L., Huang, X., Jiang, T., Chao, K.-M.: Mavg: locating non-overlapping maximum average segments in a given sequence.

Bioinformatics 19(1), 151–152 (2003)17. Lin, Y.-L., Jiang, T., Chao, K.-M.: Efficient algorithms for locating the length-constrained heaviest segments with applications to

biomolecular sequence analysis. J. Comput. Syst. Sci. 65(3), 570–586 (2002)18. min Chung, K., Lu, H.-I.: An optimal algorithm for the maximum-density segment problem. SIAM J. Comput. 34(2), 373–387 (2004)19. Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: ACM-SIAM Symposium on Discrete Algorithms,

pp. 657–666 (2002)20. Prokop, H.: Cache-oblivious algorithms. Master’s thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering

and Computer Science (1999)21. Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1), 12–22 (2007)22. Schieber, B., Vishkin, U.: On finding lowest common ancestors: simplification and parallelization. SIAM J. Comput. 17(6), 1253–

1262 (1988)23. Vitter, J.S.: External memory algorithms. In: PODS, pp. 119–128. ACM Press, New York (1998)24. Wang, L., Xu, Y.: Segid: Identifying interesting segments in (multiple) sequence alignments. Bioinformatics 19(2), 297–298 (2003)


Top Related