hierarchical memory with block transfer
TRANSCRIPT
Hierarchical Memory with Block Transfer
Alok AggarwalAshok K. Chandra
Marc Snir
mMThomas J.. W~tson.Centert P. Q. Box 218Yorkto~n J:leightstNew Yorkt 10598.
AbstractIn this paper we introduce a model of Hierarchical Memory
with Block Transfer (BT for short). It is like a random access ma
chinet except that access to location x takes time f(x)t and 'a block
of consecutive locations can be copied from memory to memoryt
taking one unit of time per element after the initial access time.
We first study the model with f(x) = XiX for 0 < a < 1. A tight
bound of S(n log log n) is shown for many simple problems: read
ing each inputt dot product, shuffle exchanget and merging two
sorted lists. The same bound holds for transposing a vn x vnmatrix; we use this to compute an FFT graph in optimal
S(n log n) time. An optimal S(n log n) sorting algorithm is also
shown. Some additional issues considered are: maintaining data
structures such as dictionariest DAG simulationt and connectionswith PRAMs.
Next we study the model f(x) = x. Using techniques similar to
those developed for the previous modelt we show tight bounds of
S(n log n) for the simple problems mentioned above, and provide
a new technique that yields optimal lower bounds of O(n log2n) for
sorting, computing an FFT graph, and for matrix transposition.
We also obtain optimal bounds for the model f(x) = XiX with
a> 1.
Finallyt we study the model f(x) = log x and obtain optimal
bounds of S(n log *n) for simple problems mentioned above and
of S(n log n) for sorting, computing an FFT graph, and for somepermutations.
1. INTRODUCTION
1.1 Background
Large computers usually have a complex memory hierarchyconsisting of a small amount of fast memory (registers) followedby increasingly larger amounts of slower memory, which may in
clude one or two levels of cache, main memory, extended store,
drums, disks, and mass store. Efficient execution of algorithms in
such an environment requires some care in making sure the data
are available in fast memory most of the time when they are
needed. Compilers, machine architectures, and operating systems
attempt to help by doing register allocation, cache management,
or demand paging, but ultimately the algorithm designer can influ-
0272-5428/87/0000/0204$01.00 © 1987 IEEE204
ence the performance in major ways. For example, in multiplying
two large matrices A x B by the standard (row times column) dot
product algorithm, one normally gets many page faults and cache
misses unless one first transposes one of the matrices so that the
elements in the rows of A are contiguous, as are those in the col
umns of B.
In general, it is important to utilize the locality of reference in a
problem. This comes in two flavors: temporal and spatial.
Temporal locality refers to the use of the same data several times
once brought into fast memory. Spatial locality refers to using
some data followed by use of neighboring ones. This is important
because even in slow memory, the time to access the first word may
be long, but then several words can be transferred into fast memory
rapidly. Most systems use this to transfer blocks of data, e.g. lines
of cache, or pages of memory.
There has been a great deal of research devoted to studying
pragmatic issues in memory hierarchies [De70, Ba80t Sm86, Si83,
MGS70, AC86, G74]t but relatively little aimed at a basic under
standing of algorithms or data structures. Examples include [K73,
MC69, FP79, F72, HK81, W83t AV87] where I/O complexity is
considered in a 2-level memory hierarchy. Part of the reason for
the paucity of theoretical work on general hierarchies may be that
no clean models have existed for this purpose. In [AACS87] the
Hierarchical Memory Model (HMM) was proposed. This is like a
Random Access Machine (RAM) model [AHU74], except that
access to location x takes time f(x). For the standard RAM model,
f(x) = 1, but in memory hierarchiesf(x) = log x andf(x) = XiX are
more natural, and were examined. In HMM, however, there was
no concept of block transfer to utilize spatial locality of reference
in algorithms. (The log-cost RAM [AHU74] is a related model
where access time is logarithmic in the value stored, not in the lo
cation -- see also [S84].) As we shall see, the capability of block
transfer results in a major change in the style and performance of
algorithms. Indeed, fairly efficient algorithms are possible even
when most of the memory is rather slow.
t.2 The Model
In this paper we propose a model, B1[, of Hierarchical Memory
with B lock Transfer. It is like a RAM with memory locations1, 2, 3, .... Access to location x takes time f(x). In addition a
contiguous block can be copied in unit time per word after t~e
startup time. Specifically, a block copy operation
[x - I, x] .. [y - l,y] is allowed. It copies the contents of location
x - i into location y - i, for 0 SiS I (and is valid if the two in
tervals [x - I, x] and [y - I, y] are disjoint). Its running time is
[(x) + I if x >y, and is fey) +1 otherwise. Standard RAM oper
ations (including indirect addressing) are also allowed -- such op
erations take unit time in addition to the memory reference times.
For example, adding the contents of locations XJl and storing the
result in z takes time 1 + [(x) + [(y) + [(z). The precise operations
allowed on data will be specified in the problems considered, andwill always be finite.
B Tf without block copy is the Hierarchical Memory Model
HMMf [AACS87], and with [(x) == 1 it is the unit-cost RAM. In
the following, we will use "block move" or "block transfer" as
synonyms for "block copy." The functions [that are of particular
interest include [(x) = r log x 1 and [(x) = rx a 1 (all logarithms
in this paper are base 2). We will write B1iog x, BTxa for these
specific cases and abbreviate these as B1iog and BTa, resp. (to
avoid confusion with the function [(x) == 1, we will not use the
abbreviation for a = 1). The case [(x) = r log x 1 is suggested for
semiconductor memory [MC80] in view of the number of levels in
a hierarchical layout and in the decode logic. Also, current com
puters have access times for semiconductor memories which may
vary from about IOns to a few hundred nanoseconds for access up26 a
to, say, 2 words of memory. The case [(x) = x seems more
applicable when extending this model to drums, disks and mass
store where access times may be tens of milliseconds for 234 words,and seconds for 2
38 words.
The BT model is fairly clean and robust. For example, one
could change the time for a block copy operation
[x - I, x] .. [y - IJl] to [(x) + fey) + I or (f(x) + l) + (f(y) + I)
and the running time would change by at most a constant factor.
Also, there is no concept of predetermined block lengths or
boundaries. It may be noted that the program itself could be stored
at the top of the memory (low numbered locations) without in
creasing the running time appreciably for programs of constant
size. We will therefore not concern ourselves with this issue in the
paper. Finally, it can also be shown that one can double the space
available to an algorithm, and run two algorithms in the same
memory, while multiplying the total running time by at most a
constant.
In general, if T and S denote the time and space taken by an al
gorithm on a RAM, then this algorithm can be executed in
OCT x [(S) )-time and O(S)-space. Time on the BTf model is also
bounded from above by time on the HMMfmodel, and from below
by time on a RAM.
1.3 The Rest of the Paper
In section 2, we study the BTa model, for 0 < a < 1. It is shown
that even to read an input of length n takes time S(n log log n), and
this also suffices-for various simple problems such as computing the
dot product, perfonning a =shuffle-exchange llermutation, and
merging two lists. .As -such, n log log n seems to corr-espond to
205
"linear time" in this model; from this follow obvious
O(n log n log log n) upper bounds for computing an FFT graph and
for sorting. In section 3, we improve both these bounds to
O(n log n).
In section 4, we study some data structures in the BTa model,
and also consider the problems of simulating straightline RAM al
gorithms, and PRAMs.
In section 5, we consider the BT model with other access time
functions. For the BTxa model with a ~ 1, the analog of "linear
time" becomes S(n log n) for a = 1 and 8(na
) for a > 1. We also
obtain optimal bounds for matrix transposition, computing an FFT
graph, and for sorting in the BTxa model with a ~ 1 and we provide
an interesting technique that provides lower bounds for these
problems when €X = 1. Finally, in section 5, we also study BTiog and
obtain optimal bounds for the above mentioned problems.
Section 6 is devoted to conclusions and possibilities for future
research.
2. BOUNDS FOR SIMPLE PROBLEMS IN BTa (0 < a < 1)
Consider the following problem of reading the input which we
call the Touch Problem: n inputs at, ... , an are stored in memory
locations 1, ... , n. An algorithm touches the input aj if, at some
time during the execution of this algorithm, ai is moved to location
1 in memory. The problem is to touch all inputs. Remark: Without
loss of generality, for various algorithms, we may assume that ac
cess to any location for an operation other than block copy is pre
ceded by copying that value into location 1 and hence "touching"
it. This increases the running time by at most a constant factor.
Theorem 2.1: Let 0 < c S 1 be a constant. Any algorithm that
touches cn inputs requires O(n log log n) time on the BTa model
where 0'< a < 1.
Proof: From the remark above, it suffices to prove this Theorem
when only block copy operations are permitted. We say that input
aj is k-touched at step t if there is some memory location j such that
aj is stored in j and j S k. Let bj(t) be the least k such thataj has
been k-touched in one of the first t steps of the computation; define
the potential at step t to be
n
ep(t) == 2 log(2)bj (t),j=1
where log(2); == max(O, log log 0. The initial potential is
ep(O) == ~ log(2)i. When the algorithm terminates at step Tat least
en input~=~ave been I-touched; hence the final potential is bounded
by
n
cp(n s 2 log(2)i.i-en
Hence
en</>(0) - </>(1) ~ L log(2) i = O(n log log n).
i== 1
We shall prove the Theorem by showing that the decrease in po
tential due to a block copy is bounded by a constant times the block
copy time.
Consider an operation that copies a block from locations
p - /, ... ,P to locations q - /, ... , q. This move may decrease
the value of b; if q < p, and then by at most
I (2) . 1 (2)(j ) h . d· 1 . . S·og J - og - p + q were a; IS store In ocatlon J. Ince
/ < q, it follows that the decrease in potential due to this move is
at most
p IL (log(2)j - log(2)(j - p + I) S (/ + 1) log(2)p - L log(2)j,
j==p-l ;==0
We conclude the proof by showing that this· decrease is no more
than a constant times the block copy time~ i. e., it is 0(/ + P a).
Without loss of generality, we may assume that log(2)p > 0:
(/ + 1) log(2)p - ~ log(2)i;==0
(l (2)(p / log p)-l (2) (2) . I (2) (2)
~ (log p - log I) + ~ (log p - log i);==0 i== p (l/ log(2)p
spa + I( log a +1).
•From Theorem 2.1, a similar O(n log log n) bound follows if
an algorithm separates en inputs from their predecessors. (An input
aj is separated from its predecessor a;-l if at some point in the al
gorithm, OJ is stored in locationj and a;_l is not.in locationj - 1.)
This is because any such algorithm can be modified by making it
use only locations that are numbered at least 2, and then preceding
any block copy operation [x - 1,x] .. [Y - IJ1] by [x,x] .. [l,l] and
[y + IJ1 + 1].. [1,1]. This causes the algorithm to take no more
running time (except for a constant factor) and ensures that any
input that is separated from its predecessor is also "touched."
Theorem 2.2: Any of the following computations can be performed
in time O(n log log n) in the BTa model for fixed 0 < a < 1.
(i) Touch problem.(li) Computing the dot product of two n-vectors.
(ill) Deterministic context-free language recognition.
(iv) Merging two sorted lists of size n each.
(v) Performing a shuffle permutation on n elements.
(vi) Odd-even exchange permutation on n elements.Furthermore, problems (i), (li), (iv), (v), and (vi) require
O(n log log n) time and for problem (lii), there are even some reg
ular sets whose recognition takes g(n log log n) time.
206
Proof Oldli.: The lower bounds follow directly when Theorem
2.1 is applied to problems (i), (li), (iv), (v), and (vi). Also, recog
nizing whether a string of length n is in the regular language 0*,
i. e., whether a {0,1} string contains only zeroes, requires touching
all inputs so that the lower bound corresponding to problem (iii)
follows. Below, we outline sketches of the upper bound proofs.
(i) The n inputs are stored in locations 1, 2, ... , n. Treat itI-a a
as a sequence of roughly n blocks of length b = rx 1 each.
Move a block into locations 1, 2, ... , b (unless it is already there)
and solve it recursively. Continue until all blocks have been proc
essed. The running time fulfills the recursionI-a a
T(n) = n T(n) + O(n) which solves to T(n) =O(n log log n)
for any constant a < 1. Note that the algorithm processes the items
in the order they occur in memory. The same idea applies (li) and
(vi).
(iii,iv,v) It can be shown that a stack (or even a deque) can
be maintained in a BTa machine in amortized cost O( log log n) per
operation, for n (push, pop, top, empty) operations, and that a
constant number of such structures can be maintained simultane
ously. Hence, n steps of a deterministic pushdown automata can
be simulated in time O(n log log n), which implies (iii). Two lists
can be merged using three stacks in O(n) stack operations; a shuffle
can be performed by executing a merge of the bottom half and the
top half of the list of items, with a suitable definition of comparison
order.
•3. MATRIX TRANSPOSITION, COMPUTING FFT GRAPHS,
AND SORTING FOR BTa , 0 < a < 1
Computational problems can often be represented as di
rected acyclic graphs whose nodes correspond to the computed
values and whose arcs correspond to the dependency relations.
Here, the input nodes (i. e., the nodes with in-degree zero) are
provided to the algorithm with given values and a node can only
be computed if all its predecessor nodes have been computed. The
time taken to compute a dag is the time taken by the algorithm to
compute all its output nodes (i. e., the nodes with out-degree zero).
The FFT (Fast Fourier Transform) graph is one such directed
acyclic graph that is quite useful, and several problems can be
solved by using an algorithm for computing the FFT graph. An
FFT graph consists of log n stages where all items are accessed in
each stage. This might seem to imply that g(n log n log log n) time
is required to compute it. However, we obtain a tight bound of
f>(n log n). The lower bound follows directly since this graph con
tains f>(n log n) nodes and the upper bound is obtained by provid-
ing fast algorithms for transposing a V; x V; matrix in time
O(n log log n). Because of Theorem 2.2, the algorithm for trans
posing a matrix is optimal within a constant factor.
For 0 < a < 1/2, a V; x V; matrix can be transposed in op
timal O(n log log n) time.by recursively transposing submatrices
of size na X na by first moving each submatrix into faster memory.
We show that O(n log log n) time can be achieved even for
1/2 S a < 1. In fact, we define a class of permutations which we
call rational permutations and show that any rational permutation
can be achieved in O(n log log n) time. A permutation is rational
if it is defined by a permutation on the bits of the binary address
of the items. Let [am-I' am-2' ... , ao] be the number represented
in binary notation by am-I, am-2' ... , tlo' i. e. , let
m-I
[am-I' am-2' ... , £10] = L ai2i.
;=0
Let a be a permutation on {m - 1, ... , OJ. The rational permu
tation To on n = 2m
elements associated with a permutes the ele
ment [am-I' ... , ao] to [aa(m-l)' ... , aa(O)]' It is easy to see thata shuffle-exchange permutation, matrix transposition, and parti
tioning a matrix into smaller submatrices are some examples of ra
tional permutations for appropriate values of n.
Theorem 3.1: Any rational permutation on a set of elements that are
stored in the first O(n) locations in the memory of a B Tex machine
can be achieved in O(n log log n) time for fixed 0 < a < 1.
Proof Outline: Let m = log n~ We define a k-input-block to consist
of 2k
inputs, initially stored in locations r2k +1, ... ,(r + 1)2
kfor
some r; these are the items with the same m - k most significant
bits. Similarly, a k-output-block consists of 2k
items that occupy
locations r2k +1, ... ,(r + 1)2
kwhen the permutation has been
computed. The proof of Theorem 3.2 relies on the following ob-
servations: '"1..
• A rational permutation To where a(i) = i for i < am (i. e., the
corresponding bit permutation involves permuting only the most
significant m - ram 1 bits) can be achieved in O(n) time. This is
because such permutation preserves ram l-input-blocks; it can be
achieved by moving each ram l-input-block (of size approximatelyniX). The cost of each move is O(n
ex), i. e., in constant time per
element.
• Any rational permutation 'Ta where a(i) = i unless i €
{k - 1, k - 2, ... , k - L 0.25(1 - a)mJ} for any k ~ 0.25m
can be achieved in O(n) time. This follows because we can move
each am-input-block from the first O(n) memory locations to thefirst O(n ex) memory locations at cost O(n ex) per block move; we can
move each a2m-input-block from the first O(n ex) memory locations
2 2to the first O(n ex ) memory locations at cost n ex per block move,
and, so on. Each move takes constant time per element. In
2/ 10g(l/a) iterations k-input-blocks are moved to the first O(2k
)
memory locations. We can then permute the data within these
blocks by applying the previous observation.
• Any rational permutation 'Ta where aU) = i unless i ~ m/4 can
be achieved inO(n) time; this is accomplished as follows. Partition
the most significant r0.75m 1 bits into intervals of size
LO.125(1 -a)mJ each and denote these intervals by
207
ap' ap-l' ... , at where p = 6/(1 - a). It is not hard to see thatany permutation on the set {m - 1, m - 2, ... , r0.25m l} can
be expressed as the product of O(p2) permutations, where each of
these permutations affect at most two consecutive intervals
(ai' ai-I)' Since each pair of intervals has at most 0.25(1 - a)m
bits, the corresponding rational permutation on n elements can be
achieved in O(n) time by using the previous two observations.
Consequently, the rational permutation in which the corresponding
bit permutation requires the permutation of the most significant0.75 log n bits can be achieved in O(P2n) time, i. e., in O(n) time
since p is a constant that only depends upon·a.
Below, we use the divide-and-conquer technique and the previ
ous observations to achieve a rational permutation on n elements.
We assume that the input is in locations 2n + 1, 2n + 2, ... , 3n
and the output will be in 3n + 1, 3n + 2, ... , 4n.
Let 0 be a permutation on m - 1, ... , 1. It is not hard to see
that a can be expressed as the product of three permutations
01(12(13 ' where(1t and a3 are permutations on the most significantr0.75m 1 bits and a2 is a permutation on the least significant
LO.5m J bits. Now, 'To is also the product of 'T01 'T02T03' Furthermore, 'Tat' 'T03 can be computed in O(n) time using the above observations and T a is computed recursively by first moving
. k 22Lm/2J . . (.appropnate = contIguous Inputs 1. e.,
Lm/2 J-input-blocks) into locations 2k + I, ... , 3k and then ap
plying 'T02 to them.
The running time of this algorithm fulfills the recursion
T(n) = In x T(In) + O(n) where T(I) = constant. Thus, it
can be verified that T(n) = O(n log log n).
•If n is any power of 4 then a In x In matrix can be trans
posed using a rational permutation. In fact, if p and q are both
powers of two then a p x q matrix can be transposed using a suit
able rational permutation; below,we consider the general case of
transposing any p x q matrix.
Corollary 3.2: Ap x q matrix can be transposed on a BTex machine
with 0 < a < 1, in time O(pq log logpq).
Proo!: Let r = Zr logpl, s = 2 r logql, and let A be the given
p x q matrix. Then, perform the following steps:
(a) Expand A to an r x s matrix B such that B(iJ) =A(iJ) for
1 ~ i S p, 1 S j ~ q, and B(iJ) is arbitrary otherwise.(b) Transpose B to obtain B T.
(c) Extract A T from B T.
Clearly, step (b) is a rational permutation and can be com
puted in O(rs log log rs) time using Theorem 3.1. Since the compu
tation of step (c) is similar to that of step (a), we only show below
how to compute step (a).
Initially, P x q ~atrix A is in locations
2rq + 1, ... , 2rq + pq. For any qI>. 1, to recursively expand a
p x ql matrix Al that is stored in locations
2rql +1, ... , 2rql + pql in row major order, let
q2 = max(l, l (rql)a/r J) and partition AI. into rql/q2 1 s~bmatrices, each (except poss~bly the last submatrix) of size p x q2; move
every submatrix into locations 2rq2 +1, ... , 2rq2 + pq2;
recursively expand that submatrix, and copy the result into appro
priate locations in 3rql +1, ... , 4rql to obtain the r x ql expan
sion of Al (the case ql = 1 and the last submatrix of A1 are specialcases that are easily handled). The rpnning time is.Q(rs log logrs)
since there are at most r log log q/ log(l/a)l recursive levels eachof which takes a cumulative running time of O(rs).
•11Ieomn 3.3: An n-point FFT grap.~ can be computed inO(n log n) time on a BTa machine for fixed 0 < a < 1.
Proof Outline: An n-point FFT graph can be computed by
recursively computing 2Vi FFT graphs on Vi. points each and
two transpositions of Vi matrices of size In xVi. Now, we can
use the transposition algorithm given in Theorem 3.2 that takes
O(n log log n) time in the worst-case, so, if T(n) denotes the time
to compute this FFT graph, we have
T(n) S 2Vi x T(Vi) + ,en log log nand T(I) = constant. This
yields T(n) = O(n log n).
•We can sort a set S of n words in O(n log n log log n) time and
O(n) memory using simple merge sort and the fact that m~rgingcan
be done in O(n log log n) time in a BTa machine (Cf. Theorem 2.2
(iv». In the following, we demonstrate an optimal O(n log n) time
algorithm, Approx-Median-Sort(S), that takes O(n log log n) space.
Approx-Median-Sort (S) assumes that the elements of set
S = {sl' ... , sn} reside in locations 2n + 1, .. , 3n and it con
sists of the following five steps:
Step 1: Partition S into n I-a subsets, SI' ... , Sn I-a, of size na
each. For 1 S j S n I-a, execute the following substeps:
Substep 1.1: Bring the elements into locations indexed
2n a +1, ... , 3n a and sort Sj recursively by calling
Approx-Median-Sort (Sj)'
Substep 1.2: Form a set A containing the (i x log n)-th element
of Sj for 1 SiS na/log n.
Step 2 : After the completion of substep (1.2), A has n/ log n ele
ments. Furthermore, a simple analysis shows that the p-th smallest
element in A is greater than at least p x log n - 1 elements in Sand
this element is also greater than at mostp x log n + (n I-a -1) X log n -1 elements in S. Now, sort A by
using merge sort.
S*" J: Form a set, B = {bl'~' ..• ', b,.l of r =na
/ log n
approximate-partitioning elements, such that for 1 S / $ r J h/ equalsthe (1 x n t-a)_th smallest element of A. At this point, note that
because of the remark made in -step (2), -there are at -leastt )( n1-4 x log n - 1 elements,of S that .are less than b, and also at
208
most (I + 1) x n I-a X log n -1 elements of S that are less thanb,.
Step 4: Now, for all}, 1 S j S n I-a, partition the elements of Sj
into r +'1 subsets, Sj,O' Sj,I' ... , Sj" such 'that for every
x e; Sj,b hiS x < h/+ 1 (treating ho as - 00 and h'+1 as +00).
I-anCompute C/ = .U Sj,/ and let I C/ I = nl' Then, from the remarks
;~1 I-ain step (3), note tliat 1 S n/ S 2 x n x log n.
Step 5 : For 0 S I Sna/log n, bring C/ to the faster memory and
sort C/ by calling Approx-Median-Sort(C/) recursively.end ,.
Theorem .3.4: Algorithm Approx-Median-Sort(S) sorts n words inO(n log n)time on a BTa machine for 0 < a < 1.
Proof Outline: The correctness of Approx-M~dian-Sort(S)is easy
to establish and hence omitted. To analyze the running time, let
T(n) denote the time. taken by Approx-Median-Sort(S) to sort n
elements. Then, subatep .(1.1) can be executed in total timeI-a a
n T(n) + O(n) for all Sj' Substep (1.2) can be performed bytransposing a (n a/ log n) x nI-a log n matrix in time
O(n loglogn). Merge sort for m = n/ log n items instep (2) takes
time Oem log m log log m) = O(n log log n). The set B is com
puted instep (3) on set A,' in time O«n log log n)/ log n). Each set
Sj can be brought into fast memory and merged with the set B in.' l~a a a
total tIme O(n x (n log log n » = O(n log log n). At that
point the sets Sj,/ have been computed; they are stored in "row
major order;" i. e., the sets Sj,O' ... , Sj" are stored in contiguousmemory locations, for each 1 S j S r. We need to bring together
the sets SI,b ... , Sn I-a,1 for each 0 SiS r, i. e., store the matrixof sets Sj,/ in "column major order." From Theorem 3.5, this permutation· can· be computed in O(n( log log n)4) time and
O(n log log n) sp~ce. Finally, step (5) can ~e executed in
O(n log log n) + l: T(n/) time where r n/ =nandn/ S 2n I-a log n./;flns implies that the running '~e of this algorithm obeys the following recurrence relation:
naflog n
I-a a 4 ~T(n) S n T(n) + O(n( log log n) ) + L.J
/=0
nO:flog n 1where T(2), ==eoflStant, ~ n/ == n, and n/ S 2n -a log n. It canthen be verified that T(n) /;oO(n log n).
•Suppose the (iJ)-th entry of a p x q matrix, M, consists of a
set R(iJ) of data elements such that IR(iJ) I ~ 1,p q
.2 .2 IR{iJ) I = n, the entries of the matrix are stored in row1=1;=1major order, and the d;ita elements of R (iJ) stored in contiguous
memory. locati0DS. Then,the problem of Generalized Matrix
Tt:ilnsposition requires the entries:Qf R(iJ)-to be stored in column-
major order such that the data elements of each individual R{iJ)
are still located in contiguous memory locations.
The set C/ for all values of I between 1 andr, in step (4) of
Approx-Median-Sort(S) can be computed using a generalized matrix transposition for a matrix of size (n a/log n) "x n I-a.
Theorem 3.5: The generalized matrix transposition problem can be
solved in O(n( log log n)4) time and O(n log log n) space on a BTamachine, for any fixed a where 0 < a < 1.
Proof Outline: The proof of Theorem 3.5 is an extension of a simple
O(n( log log n)2) time algorithm for matrix transposition -- this will
be provided in the final version of the paper., Roughly, one extra
factor of log log n arises in the algorithm for generalized matrix
transposition because we need to determine the length of the blocks
to be moved and the other factor arises because of the more com
plicated storage allocation needed.
•4.MORE RESULTS FOR TH~ BTa MODEL (0 < a < 1)
4.1 Maintaining DictionariesAs mentioned in the proof of, Theorem 2.2, stacks and
deques can implemented in the BTa model in amortized time
O( log log n) per (insertion, deletion) operation. It is not hard to
show that searching can be performed in optimal time O(n a) using
a perfectly balanced binary search tree that is stored in memory in
breadth-first order. We now consider data structures that support
all three operations.
A dictionary is a data structure that supports the following
three operations: search (key), insert (key, value)~, and delete (key);
we call the last two operations update operations. For simplicity,
we assume that an insert with the key of an entry that is already
present in the dictionary replaces that entry; a delete for a nonex
isting key has no effect. We present an efficient dictionary struc
ture for the BTxtl model, where 0 < a < 1. This structure supportssearching in time O(n a), which is optimal; updates are done in
amortized time O( log n log log n).
We may assume that a dictionary is built by a sequerice of up
date operations, starting from an empty structure. The dictionary
can be considered to consist of'a sequence of entries, where each
entry represents one update request. An update operation adds' an
entry to the head of this sequence; a search operation returns the
most recent entry with a matching key value. The search fails if
there is no such entry, or if the most recent such entry is a delete.
The sequence of entries can be reordered and compressed -
updates on different keys commute with one another, and older
updates to a key can be deleted if there is a more recent update tothe same key.
Theorem 4.1: A dictionary can be maintained in the B Ta model
(0 < a < 1) so that the worst-case time to search an element in thedictionary is O(n a) after n operations, and it takes
209
O( log n log log n) amortized time to insert or delete an element inthe dictionary.
Proof: Our construction is similar to the static-to-dynamic trans
formation method for data structures of Bentley and Saxe [BS80].
We store the sequen~e as a list T1, ... , Tk of trees, withTt con
taining updates that are more recent than those in T2 and so on. ,1';is a complete binary search tree, containing at most 2
i-
1entries.
There are no duplicate entries for the same key in a tree; however,
there may be duplicate entries in distinct trees. After n operations,
k = O( log n).
Tree 1'; is stored between locations c2i
and (c + 1)2i-1 for
an appropriate c ~ 2, with interspersed "scratch" space from
(c + 1)2ito c2
i+1-1. We will think of 1'; as having 2
i-
1 nodes (by
padding, if necessary). 1'; is an i-level complete binary tree where
the key for any node is larger than the keysJor those in left subtree
and smaller than those its right subtree. We partition 1'; into sub
trees, each of which is an ai-level complete binary tree 1';,u rooted
at node U in 1'; (for this exposition, we ignore the case where ai is
not an integer). Each 1';,u is stored in contiguous locations in pre
order (i.e., each 1';,u is sorted by keys), and 1';,u is stored before
1';,vif u is nearer to the root of T; than v, or if they are at the samelevel and u is to the left of v. Thus, for a fixed i, the nodes in all
subtrees 1';,u at the same level are contiguous and sorted -- call this
set of nodes, a slice. The three operations are performed as fol
lows:
Search - Search successively each 1';, until an entry with amatching key value is found.
Update - Let To be a one node tree, representing the update
request. Starting with i = 0, recursively perform the following:
Merge 1'; with 1';+1' and store the result in 1';+1 (1'; remains empty);
if 1';+1 overflows, then recursively merge i.t with 1';+2'Tree T; can be searched in time O(2
al) as follows. First make
space in fast memory by copying the contents of locations
1, 2, ... , 2a
; -1 in time O(2ai
) into scratch t;torage starting at
locations (c + 1) x 2;' Then, the first 200 -1 locations in 1'; (i. e.,
1'; root) are copied into locations 1, ... , 2ai
in time O(2ai
) and
. 2 . 2
searched in O(i X 2,a ) time using ai probes each taking 0(2
,a )time. This determines which tree 1';,u is to be searched next (unlessthe key has already been found), and the process is repeated at
most r1/a 1 times, ~ompleting the search of 1';. Finally, we restore
the original contents of locations 1, ... , 2ai
-1. Thus, the total
time for searching all To, ... , 1iog n is O(n a).
Now, consider a sequence, of n updates being executed on a
dictionary that contains n valid entries. We claim that two trees
of size O(m) can be m~rged in time O(m log log m); the cost per
entry of a merge is O( log log m). Every time 1'; is merged with
1';+1' each entry in 1'; is moved to 1';+1' and this may cause oneentry to disappear (if it has the same key). It follows that an entry
is merged at most k = O( log n) times. The total cost of all themerges per entry is O( log n log log n).
We conclude the analysis by proving ourolaim on merge time.
To merge 1i and 1i+ l' first sort 1f, 1f+1; next, merge the sortedlists; and finally, reconstitute 1i+ I by essentially reversing the op
erations in the first step. We claim that each step takes
O(m log log m) time for ni == 2;. Since this bound has been shown
for merging, to prove this claim, it suffices to show how to sort 1iin this time bound. But this follows easily since the f 1/a 1 slices
that partition 1j are already stored as sorted lists, and hence can
be merged pairwise in time Oem log log m).
•It is interesting to note that the similarity of the, trees 1'; given
above and the B-trees - both can be thought of as having a large
fanout and keeping siblings contiguous. However, our structure is
usually more efficient for updates in BTa but its efficiency drops
when the number of valid dictionary entries becomes much lessthan the number of operations n.
4.2 COIIIpIItiag DAGs ilIIIl Simulatioll ofStlYliglltli_ Aigoritluns
Let G == < V,E> be a directed acyclic graph. We denote by
Vc the set of computation nodes of G, i. e., the set of nodes with
nonzero indegree. The nodes with indegree zero are the input
nodes; those with outdegree zero are the output nodes. If V is a
subset of nodes, let In(V) denote the set of nodes in V - V withan edge leading into V:
In(V) == {v E V - VI (v x V) n E;6 </>},
and let Out(V) denote the output nodes of V:
Out(V) == {v E V I v E Vout v V x (V - V) n E ;6 </>}.
Define VI' Vi, to be a cut of V if
• VI U V2 == V, VI n V2 == </>;
• (V2 x VI) nE == </>.
If Vt' V2 is a cut of V then In(VI)SIn(V) and Out(V2)SOut(V).
Let G be a DAG, and let p' be a set of computation nodes in G.
V is fen )-separable if I V I == 1 , or V has a cut VI' V2 such that
1. I VI I S 2/3 I V I and I V21 S 2/3 I y' I ;
2. IIn(VI ) U Out(V.) I S f( I VI ) and
IIn(V2) U Out(V2) I S f( I VI).
3. VI and V2 are recursively f(n)-separable.
G is f(n)-separable if ~ is f(n)-separable. Separation, as defined
here, is closely related to separators as used for VLSI layout
[L82]; we have the added restriction that all edges cut by a sepa
rator are oriented in the same direction. Let G be a dag that is
f(n)-separable. We define a partition tree Tfor G as follows.
• The root of T is associated with Ve, the set of computationnodes of G.
210
• Let u be ,a node of T that is associated with a setVu of nodes
of G. If I Vu I == 1 then u is a leaf. Otherwise let VI ' ! V2 be a
cut of, Vu that fulfills conditions (1 )-(3) above. Then the left
child of u is associated with VI and the right child of. u is associated with V2.
The partition tree T has logarithmic depth; the computation nodes
of G occur as leaves of T in an order that is a topological sort of
G.We use the partition tree T of G to evaluate the dag. The eval
uation proceeds by traversing the tree recursively. When a leaf is
traversed the associated node is evaluated; when an internal' node
is traversed, data are permuted in memory. For a tree node u, let
In'(u) (resp.,Out(u» denote the set of values associated with the
nodes in In(Vu) (resp. Out(Vu». To simplify the presentation we
assume that for each set u in the partition, I In(u) I == I Out(u) I(this can be achieved by "padding" these sets). The algorithm
fulfills the following invariants:
• When the evaluation of u begins all values in In(u) are avail
able in the first "In(u) I locations in memory.
• When the evaluation of u terminates, all values in Out(u) are
available in the first IOut(u) I locations in memory.
The algorithm is described below.
eval (u);
{if u is a leaf
then evaluate u, and replace In(u) by Out(u)
else {
let u/ and ur be the left and right children of u;
using In(u)compute(In(u/), In(ur) n In(u»; --(*)
eval (u/);
using (Out(u/), In(ur) n In(u»
compute (In(ur), Out(u/) n Out(u»; --(*)
eval(ur)
}
} .It is easy to see that the algorithm preserves the two invariants
mentioned' above. Each of the two lines labeled (*) consists of
unmerging a sequence into two sequences (the input sequence is a
merge of the two output sequences with duplicate keys eliminated),
possibly followed by a block exchange. Such unmerging can bedone in time O(n log log n). The total running time of this algo
rithms fulfills the following recursion:
ten) == tent) + t(~) + O(f(n) log log(f(n») where t(l) == 1,
nl +~. == n, 1/3n S nl, ~ S 2/3n.
Note that this algorithm does not determine the partition tree
T from the graph G nor the sets In(u), Out(u) -- only the times for
the data movement and computation in G are counted. We call such
an algorithm a nonuniform algorithm. Below, we provide two ap-
plications of the above algorithm that computes general directedacyclic graphs.
Corollary 4.2: If a straight-line algorithm takes T(n) time on aRAM then it can be simulated on a BTa machine (0 < ex < 1) inO(Tlog Tlog log n time by a nonuniform algorithm.
Proof: A straight line RAM algorithm of length T is represented bya bounded degree dag with Tnodes. Every n-node bounded degreedag is O(n)-separable. This implies that a dag with Tnodes can be
evaluated in the BTa model (0 < ex < 1) in time ten, where ten
fulfills the recursion: t(n = t(T1) + t(T2) + O( T log log n where
T1 + T2 = T, T/3 S T1,T2 S 2T/3. This recursion yieldsten = O(Tlog Tlog log n.
•The best known lower bound for DAG simulation seems to be
O(Tlog n which can be derived from Corollary 5.9.
Corollary 4.3: Two n x n matrices can be multiplied by a nonuniform algorithm in O(n
3) time on a BTa machine (0 < ex < 1) using
( +, x ) only.
Proof: Hong and Kung [HK81] have shown that the dag representing the classical matrix multiplication algorithm (that uses+ and x only) is 0(n
2/
3) -separable. Consequently, we can use
the above algorithm to show that two matrices can be multiplied inthe BT model in time t(n
3), where ten) fulfills the recursion
a 2/3ten) = tent) + t(~) + O(n log log n) where nl + ~ = n,
1/3n S nt, ~ S 2/3n. Since this recursion yields ten) = O(n), itfollows that two matrices can be multiplied in O(n 3) time which isoptimal within a constant factor.
•The restriction of nonuniformity can be eliminated for matrix
multiplication. Use a simple divide-and-conquer lJlethod: an n x n
matrix can be multiplied by computing 8 products of n/2. x n/2
matrices. Furthermore, a matrix with m elements can be brokeninto 4 submatrices by "unmerging" a list into four sublists in timeOem log log m). It follows that the running time of this algorithmfulfills the recursion ten) = 8t(n/2) + 0(n
210g log n), so that
ten) = O(n3
). The algorithm derived in Corollary 4.3 is essentiallysimilar.
4.3. Simulation of PRAM Algorithms by a BTa Machilll!
Theorem 4.4: Let t, s, andp denote, respectively, the time, memoryspace, and the number of processors required by a Concurrent Read
Concurrent Write PRAM algorithm. Then, this algorithm can besimulated in time OCt x (s log log s + p x logp» on a BTa machine (0 < ex < 1).
Proof: We use a simulation method due to Awerbuch, Israeli, andShiloach [AIS83]. The first O(P) locations in memory are used tostore processor records that represent the state of each simulatedprocessor; the next O(s) memory locations contain a list of memory
records representing the content of the PRAM memory. A PRAMstep is simulated as follows.
211
• The next instruction of each processor is simulated; if this instruction accesses memory then an access record containing thememory address, the processor id, and the stored value for awrite, is created. The time for this phase is O(p log logp).
• The access records are' sorted according to their address intime O(p log p) and write conflicts are eliminated.
• The list of access records is "merged" with the list of memoryrecords in time O((p + s) log log(p + s»= O(s log log s + p log p). The merge is actually an updateoperation that modifies each of the two lists: if an access record represents' a write, then the corresponding memory record is updated; if it is a read then the value of correspondingmemory record is copied to the access record -- concurrentreads are also handled here.
• The access records are sorted by processor id in timeO(p log p) and write conflicts are eliminated.
• The list of access records is "merged" with the list of processorrecords; the state of each processor that executed a read instruction is updated to contain the returned value.
The total time for the simulation of one PRAM step isO(s log log s + p logp).
•Corollary 4.5: The connected components of an undirected graphG with n vertices and m edges can be obtained in Oem log2n) timeon a BTa machine (0 < ex < 1).
Proof: Shiloach and Vishkin [SV82] have provided a parallel algorithm that takes O( log n) time and O(m) memory space usingn + m processors and that computes the connected components ofa graph with n nodes and m vertices.
•5. BOUNDS FOR OTHER MODELS
5.1 Boulids for BTx Model
Time bounds for various problems in the BTx model (i. e.,BTx a with ex = 1) are listed in the table given below. TheS(n log n) bounds for simple problems such as merging, shuffleexchange permutation, and the touch problem are proven using thesame techniques as in Theorems 2.1 and 2.2; the O(n log2n) upperbounds for matrix transposition and computing FFT graphs can beobtained by using log n stages of the shuffle-exchange permutationand that for sorting is obtained by using a simple merge sort algorithm. The dictionary algorithm is similar to that given in Theorem4.1. However, unlike the upper bounds, the lower bounds forcomputing FFT graphs,.matrix transposition, and sorting require anew technique which is described below.
A computation is conservative if the only operation used isblock moves. We have the following lower bound for conservativepermutation algorithms.
1'ItIJorem 5.1: The average number of steps required to perform arandomly chosen permutation on n items using a conservativecomputation on a BTx machine is D(n log2n).
Proof Outline: It is useful to define models ofa· two-level memory
system (denoted by L(m» and a specially-blocked BTx machine, andprove Lemmas 5.2 and 5.3 for these models.
We define a two-level memory system, L(m), as one in whichthere are two memories - a primary memory that consists of mwords (e. g. a cache) and a secondary memory that is pot~ntially
infinite in size. The processor is connected to the primary memory.Any set of b S m/2 data items that are present in the contiguouslocations of the secondary memory can be transferred to any b lo
cations in the primary memory in one transfer step. Conversely,contents of any b locations in primary memory can be copied into
contiguous locations in the secondary memory. In this model, it IS
assumed that the processor can perform any simple operation like
reading from (or writing into) primary memory or it can comparetwo words in the primary memory. Here, we are only in~erested inthe number of transfer steps that are required for solving a given
problem in this model and since Aggarwal and .Vitter [AV87] haveconsidered this model in some detail, we quote the following result
from their paper:
Lmuna 5.2 [A va1]: The average number of transfer steps neededto perform a random permutation on neleinents using a conservative computation on an L(m) machine is D( min(n ,
(n/m) x log(n/m»).
Proof : The proof of Lemma 5.2 can be found in [AV87 ].
•A specially-blocked B Tx machine is one in which the memory
is partitioned into sets of contiguous locations such that the i-th set,Sj, contains the 2; memory locations [2;, 2
i+
1 -1] and any algo
rithm for this machine transfers a block of contiguous data items
only from set Sj to set Sj_l or from Sj_l to Sj' for j.~ 1.
Lmuna 5.3: If a conservative algorithm runs on.a BTx machine intime T, then this algorithm can be modified into a conservative algorithm for a specially-blocked BTx machine that runs in time
O(n.
Proof: Partition the memory of a B Tx machine M' into sets ofcontiguous locations like the specially-blocked machine, i. e., thei-th set, s';, has size 2;. The simulating specially-blocked machine
M stores the content of s'; in the lower part of 8i+2; the upper partis used as a buffer. Consider a block move [x - I, x] .. [y - /J'] inM' where x >y (the case x <y is treated symmetrically). The costof this move is at least x. This move is simulated by M as follows:Let j = l log(x - l) J. Then, the block is contained in S'j U S'j+1
and hence in Sj+2 U Sj+3. If part of it is in Sj+3 then move it intothe buffer of Sj+2 to obtain a contiguous block B in M to be moved.If part of B has to go to S}+2 (i. e., to S'}), copy this piece directly.
212
Copy the rest of B into the buffer of Sj+ 1. If part of'this has to go
to Sj+l' copy it, and move the rest to the buffer of Sj' and, so on.The total time for all these moves is bounded by
j+3
:L2i+1 < i+5 == O(x).
i-O
•Proof O"tline . (of 1'ItIJorem 5.1 continued): Now, foroSiS log n - 2, consider a 2-level memory system, L(mi) such
i+lthat mi = 2 -1 and let 1j(mi) denote the expected number oftransfer steps that are required to achieve a permutation of n wordson L(m;). Using Lemmas 5.2 and 5.3, it follows that the expectedtime taken by any conservative algorithm on a BTx machine is atleast
logn-2
D( :L 2i
x Ti(m;».;==0
Consequently, the expected time taken to permute n words on aBTx machine is at least
logn-2
c x :L 2i
x min(n, (n/2i) log(n/2
i»;-0
which is n(n log2n).
•T1teoIWll 5.4: Any algorithm that sorts n words (using only com
parisons and block moves), computes an FFT graph or transposesa~ x~ matrix takes, n(n log2n) time on a BTx machine.
Proof Outline: A sorting algorithm can be used to perform an arbitrary permutation by suitably fixing the outcome of the comparisons; three FFT graphs can be cascaded to obtain a permutationnetwork. Thus, the lower bound for these two problems followsfrom Theorem 5.1. The lower bound for transposing a~ x~
matrix is obtained by SUitably modifying Lemma 5.2 and the modified Lemma can be obtained from [AV87].
•5.2 BOIIIIIIs for BTa Model With €X > 1
For €X > 1, the time bounds for various problems in the BTamodel are lis~~d in the table given below. The time required for
simple problems such as merging, shuffle-exchange permutation,and the touch problem is dominated by the time needed to accessthe farthest element (8(n
lX». The upper bounds for matrix trans-
position, computing FFT graphs, and sorting can be obtained bysimple divide-and-conquer algorithms. The upper bound for matrixmultiplication can be obtained by using the algorithm followingCorollary 4.3. The only nontrivial bound is the lower bound formatrix multiplication (using semiring operations only) and this isgiven below:
1'ItIJorem 5.5: For €X = 1.5,multiplying two n x n matrices usingonly ( x , + ) requires n(n
3 10g n) time on a BTa machine.
Proof Outline: Suppose two Vii x Vii matrices are stored in the
secondary memory of a two-level memory L(m) machine. Then,
it can be shown that any algorithm that uses only ( x , + ) takes
n(n31m3/2) transfer steps to multiply th~se matrices. Now, for
o SiS log n -1, consider a 2-level memory system, L(mi)' such
that mi = 2i+1 -1 and let T;(~i)'denote the minimum number of
transfer steps that are required to multiply two matrices on L(mi).
Using Lemmas 5.2 and 5.3 (which can be modified to apply to
BTxa, for a > 1, as well), it follows that the time taken by any al
gorithm that multiplies two matrices in BTxl.s-model and that uses
logn-l 3/2( x , + ) only is at least ~( .~ 2 I x .T~(mi»' Now, the resultfollows by noting that T;(mij-~ n(n
3/2
3'/ ) for multiplying two
matrices on L(mi).
•5.3 Optimal BOIInds for B1iog Model
The time bounds for various problems in the B1iog model are
listed in the table given below. The upper bounds for simple prob
lems such as merging, shuffle-exchange permutation, and the touch
problem are similar to those given in Theorem 2.2 except that the
data are now partitioned into blocks of size log n instead of n a. The
lower bound proof for these problems is also similar to the proof
of Theorem 2.1, and the modifications are explained below. The
upper bound of O(n log *n) for matrix transposition is obtained by
a divide-and-conquer algorithm and the lower bound follows be
cause matrix transposition requires all inputs to be separated from
their predecessors. The upper bound of O(n log n) for computing
FFT graphs, computing arbitrary permutations, and for sorting
follow from Theorems 3.3 and 3.4. Tsteps of a straight-line RAM
algorithm can be simulated in time O(T log T log *n by adapting
the algorithm given in section 4.2. A dictionary can be maintained
with search time O( log3n/ log log n) and with amortized update
time O( log n log *n) using a structure similar to that in section 4.1.
Another dictionary structure is obtained using a B-tree
[AHU83] where buckets have size 8( log n). Each bucket is or
ganized as a balanced search tree stored in contiguous memory lo
cations. A dictionary operation requires D( log n/ log log n)
accesses to buckets. Each bucket can be moved to fast memory in
time D( log n). The operations performed on these buckets are
search, insert, delete, merge and split. Each of these operations can
be performed (for merge and split this is time amortized over nupdates) in time D( log
2m ) = D( ( log log ·n)2). Thus, each dic
tionary operation can be executed in time D( log2n/ log log n).
The bucket size has to be updated when the number of entries
grows beyond n2or decreases beneath Vii; this updating can be
done in amortized time D( log2n/ log log n) per entry. Therefore,
the structure supports search in time D( log2n/ log log n) and in
serts and deletes in amortized time D( log2n/ log log n). The
searching time can be proved to be optimal within a multiplicative
constant.
Theorem 5.6: Let 0 < c S 1 be a constant. Any algorithm that
touches cn inputs has time complexity n(n log *n) in the B1iogmodel.
Proof: nAs in Theorem 2.1, define the potential at step t as
ct>(t) = I log *bi(t) which implies that ct>(O) - ct>(nen i-I
~ I log *i = O(n log *n). Now, if an operation copies a blocki-I
from locations p - I, ... ,p to locations q - I, ... , q, then the
potential may decrease if q <p. The time for this move is
log p + I, and following Theorem 2.1, we have that
I
act> S (1 + 1) log *p - L log *i.i=O
Consequently, we conclude the proof by showing"that
I
(I + 1) log *p - L log *i S 2(1 + logp).i=O
IBut (l + 1) log *p - I log *i
i-O
log logp I
L (log *p - log *i) + L (log *p - log *0i-O i= log logP~ 1
Slog logp log *p + 2(/- log logp).
The first term is bounded by 2 log p, and the second term is
bounded by 2/.
•Note that there are many permutations, even permutations in
Yolving all inputs, that can be computed in linear time. For exam
ple, it is possible to exchange the first n/2 inputs with the last n/2inputs in linear time; this permutation moves all elements, and has
O(n2) inversions. On the other hand, there are permutations that
require O(n log n) time to compute. We prove their existence using
a nonconstructive counting argument; we have no explicit example
of such a "hard" permutation.
Theorem 5.7: In a B1iog machine, the number of distinct sequences
of block copy operations with a total time t is at most 9t.
Proof: Let C = CI , C2, ... , Ck be a sequence of block copy. op
erations Ci = ([Xi - Ii' Xi] ... [y; -Ii' Yi])' 0 < Ii < xi'Yi andk
[x. -I· x·] n [yo - I,., -"".] = ct>; its running time is t(C) = .I t(Ci),, " I I . ,-I
wh~re t(Ci) = r loge max(xi,Yi» 1 + I;. We will give a one-to-one
map from such sequences into sequences r = rl' r2' ... , r3k of in-3k
tegers r· > 2 with a cost function cost(r) = I cost(r;),, - , i-I
cost(r,.) = r log ri 1, and having 2t(C) ~ cost(r). The resUlt thent-l.
follows from the claim that I {r Icost(r) == t} I == 3
213
RAM B1jog X BTxlX B1~lX BTlXX
= BTl (0 < IX < 1) (a = 1) (IX> 1)
Simple
Problems Sen) 9(nlog+n) 8(nloglogn) 8(nlogn) 8(na
)
(e.g. Touch)
Matrix 8(n3
) ifa. < 1.5
Multiplication3 S(n
3) 8(n
3) 8(n
3)
3S(n ) S(n logn)
lfa = 1.5
(+, +) 8(na
) ifIX > 1.5
Arb. Perm. 8(n) S(nlogn) 8(nlogn) S(r( logn)2) S(na
)
(3)
Mat. Transpose,
Rational 8(n) S(nlog+n) 8(nloglogn)2
8(na
)8(r(logn) )
Permutations
FFT Graph 9(nlogn) 8(nl~gn) S(nlogn)2
S(na
)8(r(logn) )
Sorting 8(nlogn) 8(nlogn) S(nlogn) 8(r(logn)2) 8(na
)
Searching S(logn)2
S(na) 8(na
)8«logn) Iloglogn) Sen)
Insert ~logn)2
O((lognl) 8(na- l )~(logn) Iloglogn) 0(lognloglogn)
& Delete (Amortized) (Amortized) (Amortized) (Amortized)
TABLE
The one-to-one map from C to r is induced by a one-to-one
function h that maps C; into ('3;-2' '3;-1' '3;)' We abbreviateh([x - 1,x]"'[Y - 1,Y]) by h(l,x,y), and define h as follows:
(a) h(l, x, y) = (I, x, y) if I ~ 2.(b) h(O, x, y) = (x, y, y) if x, y ~ 2.(c) h(l, x, y) = (x, y, x).
(d) h(O, 1, y) = (Y, y + 1, y).
(e) h(O, x, 1) = (x + 1, x, x + 1).
It can be seen that h is one-to-one since 1< XJ'; x ~ y; and ifI = 1 then x is neither y - 1 nor y nor y + 1. It can also be checked
that for all i, 2t(Ci) ~ cost(h(Ci»·Now, to establish the claim that there are 3
1-
1sequences r with
cost(r) = t, let At = I {r Icost(r) = t} I , and observe that a sequencewith cost t is obtained either from a sequence with cost t - 1 byappending a 2, or from a cost t - 2 sequence by appending a 3 ora 4. and so on. This yields a recurrence
214
•
•
Theorem 5.8: In a B1iog machine, there is a constant c such that thenumber of distinct sequences of operations (block copy or other
wise) with a total time S t is at most 2ct
.
Proof: The proof follows as a direct extension of the above proof.
The constant c depends on the number of different kinds of oper
ations allowed in the B1iog machine.
One can also consider several extensions of the BT model. For
example, one could study other access functions such as step func
tions that may better reflect the physical situation of the memory
hierarchy in real machines. Another possibility is that the transfer
time for blocks could· be changed from one per word to some
function g(x) per word for copying from (or to) location x. Yet
another aspect that is significant for real memory hierarchies is
parallelism in block transfer (several reads from different disks may
take place simultaneously; transfers at different levels may proceed
concurrently, and can be overlapped with processing). Tradi
tionally, this represented one of the early uses of parallelism in
computers, and from a theoretical point of view, would seem nec
essary if we are to have data structures with good worst-case per-
formance. And, finally, in view of the importance of the memory
organization in multiprocessor machines, some appropriate model
for this would be nice. In any case, it is desirable that model ex
tensions remain clean and robust.
This paper can be thought of as a step in developing a theory
of computation that is aimed at data movement as against data
modification. It is too early to tell if such a
memory/ communication oriented (versus CPU oriented) theory
of computation will have any influence on pragmatic algorithms,
machine architecture, memory management, or language design.
Some generalities are beginning to emerge. It appears that
some problems (FFT, sorting, matrix multiplication) are very well
behaved in that their running time (on BT or HMM using different
access functions) usually equals the maximum of the RAM time
and the time to read (i. e., touch) the inputs in the hierarchical
memory (there seems to be a slight penalty when RAM time and
touch time are roughly equal). This is about the best that good lo
cality of reference could provide. Other problems (e. g. DAG sim
ulation) appear not to behave in this manner. We do not
understand what characterizes such behaviour.
directory algorithms for BTa be improved'! And, it seems impor
tant to study good memory management algorithms.
•
t-l+ 2 Ao where Ao = 1. It solves to
6. Conclusions
In this paper we have introduced a model for hierarchical
memory with block transfer. It is relatively clean and robust, but
nevertheless appears to mimic the behavior of real machines. Good
algorithms in this model utilize both temporal and spatial locality
of reference. The theory for this model appears to be rich and
deep.
There are a large number of possibilities for future research;
some are specific technical problems, other are more general issues.
The Corollary applies even to non-uniform algorithms that
perform block transfers or a finite set of other arbitrary powerful
operations; and in particular, to conservative algorithms. It re
quires, however, that the permutation to be achieved is not given
as an additional input. It may be noted that the lower bound results
of Theorems 5.8 and Corollary 5.9 also apply to any BTf machine
where f(x) = D( log x) (provided f(x) = 0 for at most one value
of x) and, in particular, they apply to BTa for any ex > o.
At = At- I + 2At- 2 +At = 3
t- for t ~ 1.
Corollary 5.9: The expected time for achieving a random permuta
tion on n elements on a B1iog machine is D(n log n).
Proof: A simple counting argument yields a lower bound of
(1/c) x n log n x (1 - 0(1» where c is the constant in Theorem
5.8.
In the BTa model, there is a sorting algorithm that takes
O(n log n log log n) time and O(n) space and another algorithm
that takes O(n log n) time and O(n log log n) space. Is it possible
to obtain an O(n log n) algorithm that takes only O(n) space? And,
are there problems for which there are space-time tradeoffs?
Permuting data is at the heart of many BT algorithms and it
would be nice to understand better the complexity of permutations.
For example, even in the BIiog model, we showed that most permutations require at least 0 (n log n) time. Can one demonstrate
such a permutation even for conservative algorithms?
There are numerous other areas, such as data structures, andgraph problems, that need to be analyzed. As an example, is it
possible to maintain directories in B1iog with O( log2n/ log log n)
search time and O( log n log *n) amortized update time? Can the
Acknowledgements: The authors wish to thank Michael Fisher forseveral useful suggestions.
References[AC86] R. C. Agarwal and J. W. Cooley, "Fourier Transform and
Convolution Subroutines for the IBM 3090 Vector Facility," IBM
J. of Research and Development, March 1986,145-162.
[AV87] A. Aggarwal and J. Vitter "The I/O Complexity of Sorting
Related Problems," Tech. Report, Dept. of Computer Science,Brown University, August 1987. Some results of this report can be
found in Proc. of the 14th Int. ColI. on Automata, Languages and
Programming, Karlsruhe, West Germany, July 1987,467-478.
[AACS87] A. Aggarwal, B. Alpern, A. K. Chandra and M. Snir,
"A Model for Hierarchical Memory," Proc. of the 19th Annual
215
ACM Symposium on the Theory of Computing, New York, 1987,305-314.
[AHU74] A. V. Aho, J. E. Hopcroft and J. D. Ullman, The Design
and Analysis of Computer Algorithms, Addison-Wesley, 1975.
[AHU83] A. V. Aho, J. E. Hopcroft and J. D. Ullman,es Data Structures -and Algorithms, Addison-Wesley, ReadingMass., 198.3.
[AIS83] B. Awerbuch, A. Israeli and Y. Shiloah, "Efficient Simulation of PRAM by Ultracomputer", Technical Report 120, IBMScientific Center, Haifa, Israel, May 1983.
[Ba80] J. L. Baer, Computer Systems Architecture, Computer Science Press, Potomac MD, 1980.
[BS80] J. L. Bentley and J. B. Saxe, "Decomposa~le Searchingproblems. I. Static-to-Dynamic Transformations," J. of Algo·rithIns, Dec. 1980, 301-358.
[De70] P. J. Denning, "Virtual Memory," ACM Computing Surveys, Sept. 1970, 153-189.
[F72) R. W. Floyd, "Permuting Information in Idealized TwoLevel Storage," In R. E. Miller and J. W. Thatcher (editors),Complexity of Computer Computations, 105-109, Plenum Press,New York, 1972.
[FP79] P. C. Fischer and R. L. Probert, "Storage ReorganizationTechniques for Matrix Computation in a Paging Environment,"Comm. ACM, Vol. 22, No.7, July 1979,405-415.
[G74] J. Gecsei, "Determining Hit Ratios in Multilevel Hierarchies," IBM J. of Research and Development, July 1974,316-327.
[HK81] J. W.Hong and H. T. Kung, "I/O Complexity: The RedBlue Pebble Game," Proc. of the 13th Ann. ACM Symposium onTheory of Computing, Oct. 1981, 326-333.
216
[K73] D. E. Knuth, The Art of Computer Programming; Vol. 3,
Sorting and Searching, § 5.5.9, Addison-Wesley, Reading Mass.,
1973.
[L82] F. T. Leighton, "A layout Strategy for VLSIWhich IsProvably Good," Proc. of the 14th Ann. ACM Symposium onTheory of Computing, Oct. 1982, 85-98.
[MGS70] R. L. Mattson, J. Gacsei, D. R. Slutz and I. L. Traiger,Evaluation Techniques for Storage Hierarchies," IBM Systems
Journal, 1970,78-117.
[MC69] A. C. McKellar and E. G. Coffmann Jr., "OrganizingMatrices and Matrix Operations for Paged Memory Systems,"Comm. ACM, Vol. 12, No.3, March 1969, 153-165.
[MC80] C. Mead and L. Conway, Introduction to VLSI Systems
and Related Systems, Addison-Wesley, Reading Mass., 1980,
pg.316.
[S84) A. Schonhage, "A Nonlinear Lower Bound for Random Access Machines Under Logarithmic Cost," Technical Report RJ
4527, IBM Almaden Research Laboratory, May 1984.
[Si83] G. M. Silberman, "Delayed-Staging Hierarchy Optimization," IEEE Trans. on Computers, TC-32, Nov. 1983, 1029-1037.
[SV82] Y. Shiloach and U, Vishkin, "An Of log n) ParallelConnectivity Algorithm," J. of Algorithms, 1982, 57-67.
[Sm86] A. J. Smith, "Bibliography and Readings on CPU CacheMemories and Related Topics," Compo Arch. News, Jan. 1986,22-42.
[T83] R. E. Tarjan, Data Structures and Network Algorithms, SIAM,Philadelphia Penn., 1983.
[W83] C. K. Wong, Algorithmic Studies in Mass Storage Systems,Computer Science Press, Rockville MD., 1983.