routing merging sorting

8/14/2019 Routing Merging Sorting

1/16

J OURNAL OF COMPUTER AND SYSTEM SCIENCES 30, 130-145 (1985)

Routing, Merging, and Sorting onParallel Models of Computation*

A. BORODINDepartment of Comput er Science, Univ ersit y of Toronto,

Toronto, Ont ario, Canada, M SS I A7

J. E. HOPCROFTDepartment of Comput er Science, Cornel l Uni versit y,

I t haca, N . Y. 14853Received une 13 1983; revised July 15, 1984

A variety of models have been proposed for the study of synchronous parallel com putation.These models are reviewed and some prototype problems are studied further. Two classes ofmodels are recognized, fixed connection networks and models based on a shared m emory.Routing and sorting are prototype problems for the networks; in particular, they provide thebasis for simulating the more powerful shared memory models. It is shown that a simple butimpo rtant class of deterministic strategies (obliviou s routing ) is necessarily inefficient withrespect to worst case analysis. R outing can be viewed as a special case of sorting, and theexistence of an O(log n) sorting algorithm for some n processor fixed connection network hasonly recently been established by Ajtai, Komlos, and Szem eredi ( 15th ACM Sympos. onTheory of Cornput., Boston, Mass., 198 3, pp. l-9). If the more pow erful class of sharedmemory models is considered then it is possible to simply achieve a n O(log n loglog n sort viaValiants parallel m erging algorithm, which it is shown can be implemented on certain m odels.Within a spectrum of shared m emory models, it is shown that loglogn is asymptoticallyoptimal for n processors to merge two sorted lists containin g n elements. 0 1985 Academic Press,Inc.

I. INTRODUCTION: WHAT Is A REASONABLE MODEL?A number of relatively diverse problems are often referred to under the topic of

parallel comp utation. The viewpoint of this paper is that of a tightly coupled,synchronized (by a global clock) collection of parallel processors, working togetherto solve a terminating computational problem. Such paral l el pr ocessors alreadyexist and are used to solve time-consum ing problems in a wide variety of areas

*This research was supported in part by ON R contract NO OO 14-76 -C-001 8 and NS F grantMCS-81-01220.1300022~OOOO/SS3.00

Copyright 6 1985 by Academic Press, Inc.Al l rights of reproduction in any form reserved.


2/16

PAR AL L EL MODEL S OF COMPUT AT ION 131including com putational physics, weather forecasting, etc. The current state ofhardware capabilities will facilitate the use of such parallel processors to manymore applications as the speed and the number of processors that can be tightlycoupled increases drama tically. (A very good introduction to the future promise ofhighly parallel com puting can be found in the January, 1982 issue of Com pu t e r ,published by the IEEE Computer Society.)

Within this viewpoint, P reparata and Viullemin [20] d istinguish two broadcategories. N amely, we can differentiate between those models that are based on afixed connection network of processors and those that are based on the existence ofglobal or shared m emory. In the former case, we assum e that only graphtheoretically adjacent processors can comm unicate in a given step, and we usuallyassum e that the network is reasonably sparse; as examples, consider the shuf-fle-exchange network (Stone [26]) and its development into the ultracomputer ofSchwartz [22], the array or mesh connected processors such as the Illiac IV, thecube-connected cycles of Preparata and Viullemin [20], or the more basic d-dimen-sional hypercube studied in Valiant and Brebner [29]. As examples of mod elsbased on shared mem ories, there are the PR AC of Lev, Pippenger, and Valiant[lS], the PR AM of Fortune and Wy llie [S], the unnam ed parallel model ofShiloach and Vishkin [23], and the SIM DA G of Goldschlager [ 111. Essentiallythese models differ in whether or not they allow fetch and write conflicts, and ifallowed, how write conflicts are resolved.

From a hardware point of view, fixed connection mod els seem more reasonableand, indeed, the global mem ory-processor interconnection would probably berealized in practice by a fixed connection network (see Schwartz [22]). Further-more, for a number of important problems (e.g., FFT, bitonic m erge, etc.) either theshuffle-exchange or the cube-connected cycles provide optimal hosts for well-knownalgorithms. On. the other hand, m any problems require only infrequent andirregular processor comm unication, and in any case the shared memory frameworkseems to provide a more convenient environment for constructing algorithms.Finally, in defense of the PRA M , it is plausible to assume that some broad castfacilities could be made available.

The problem of sorting, and the related problem of routing are prototypeproblems, due both to their intrinsic significance and their role in processor com-mu nication. Since merging is a (the) key subroutine in many sorting strategies, weare interested in merging and sorting with respect to both the fixed connection andshared memory mod els. For many fixed connection networks ([20 ,26,29]) thecomplexity of merging has been resolved by the fundam ental log n algorithms ofBatcher (see Knuth [ 151 for a discussion of odd-even and bitonic merge). Thelower bound in this regard is imm ediate because log n is the graph theoreticdiameter. In this paper, we concentrate on routing in networks and the complexityof merg ing (w ith application to sorting) on shared m emory machines.


3/16

32 BORODIN NDHOPCROFTII. ROUTING N NETWO RKS

The problem of routing packets in a network arises in a number of situations.Various routing protocols have been considered. For packet sw itching networks weare concerned with arbitrary interconnections. For these general networks,protocols such as forward state controllers have been studied. For applications inparallel computer architecture we are concenred with special networks such as ad-dimen sional hypercube or a shuflle exchange interconnection. In this settingO(log n) global or centralized strategies and O (log n) local or distributed strategiesare known.

For our purposes, a network is a diagraph ( V, E), where the set of nodes V arethought of as processors, and (i, j) E E denotes the ability of processor i to send onepacket or message to processor j in a given time step. A packet is simply an(origin, destination) pair or, more generally, (origin, destination, bookkeepinginformation ).

A set of packets are initially placed on their origins, and they must be routed inparallel to their destinations; bookkeeping information can be provided by anyprocessor along the route traversed by the packet. The prototype question in thissetting is that of full permu tation routing, in which we mu st design a strategy fordelivering n packets in an II node network, when rc: {origins} + (destinations} is apermu tation of the node labels { 1,2,..., n}. Other routing p roblems, in particularpartial routing, where z is l-l but there may be less than N packets, are also ofimm ediate interest. Unless stated otherw ise, a routing strategy will denote asolution to the full permu tation routing problem.There are three positive routing results which provide the motivation for this sec-tion.

(1) Batchers (see Knuth [15]) O(log* n) sorting network algorithm (say,based on the bitonic m erge) can be implemented as a routing strategy on varioussparse networks, such as the shuffle exchange (Stone [26]), the d-dimen sional cube(Valiant and Brebner [29]), or the cube connected cycles (CCC-Preparata andVuillemin [20]). This constitutes a local or distributed strategy, in that eachprocessor p decides locally on its next action using only the packets at p. Indeed,the algorithm can be implemented so that there is exactly one packet per processorthroughout execution. We also note that Batchers bound is a worst case bound,holding for any initial permutation of the packets; in fact, every initial permutationuses the same num ber of steps.

(2) Valian t and Brebner [29] construct an O(log n) Monte Carlo localrouting strategy for the d-cube, where d= log n. In a Monte Carlo strategy,processors can make ran dom choices in deciding where to send a given packet.Valiants strategy achieves O(log n) in the sense that for every permutation, withhigh probability (e.g., 3 1 -n -) all packets will reach that d estination in


4/16

PARALLEL MODELS OF COMPUTATION 133with high probability (here, the probability is on the space of possible inputs). Thenaive strategy is to order the dimen sions of the cube, and then proceed to movepackets along those dimensions (in order) in which the origin and destination differ.These initial results have been extended to the shuffle exchange model (Aleliunas[l]) and a variant CC C+ of the CC C model (U pfal [27]), these latter networkshaving only constant degree a s contrasted with the log n degree of the cube.Following this development, Reif and Valiant [21] are now able to achieve anO(log n) Monte Carlo so r t i ng algorithm for the CC C+.

(3) The Slepian-Benes-Clos permu tation network (see Lev, Pippenger, andValiant [ 181) can be implemented as a global worst case O(log n) routing scheme(on any of the above-mentioned sparse networks). Here, a pair of processors at agiven time step simulate a switch of the permu tation network; as such, the actionsof any processor depend on the entire permutation.

We note that each of these strategies can be modified to handle partial routing.(This is imm ediate except for the strategy derived from Batchers a lgorithm.) Inorder to mak e the distinction between local and global more mean ingful we mu stlimit the amo unt of bookkeeping information (say to O(log n) bits) since otherwisewe can funnel all the input perm utation information to a distinguished processorwhich then broadcasts this information throughout the network. Recently, Ajtai,Kom los, and Szemeredi [2] have constructed an ingenious O(logn) sortingalgorithm for the comparator network model (see Knuth [ 151). Since the gates ineach stage of the comp arator network have constant degree, these gates can becollapsed to form an O(log n) degree ne twork in our sense. At the moment, it is notclear whether or not this algorithm could be implemented (within the sameO(log n) bound) on any of the specific fixed connection networks mentioned in thispaper. N or is it clear whether the implied con stant could be mad e sm all enough tomake the algorithm more competitive. Yet, this result stands as a theoreticalbreakthrough, further challenging us to search for a simp le and very efficient localrouting (or sorting) method for specific networks of interest.The Re la t i on Bet w een F i xed Connec t i on and G l oba l M em ory M odel s

Before proceeding to discuss routing algorithms for fixed connection networks,we wan t to briefly relate such parallel machines with parallel models based on aglobal memory. Indeed, this relation is yet another motivation for the importanceof the routing and sorting prob lems. The importance of the routing problem is alsoemp hasized in the paper of Galil and Paul [lo] who consider simulations betweenvarious parallel m odels. We men tion only a few global memory mod els:

1) PRAC (Lev, Pippenger, and Valiant)--Simultaneous read or write (of thesame cell) is not allowed (also called ER EW RA M ).(2) PR AM (Fortune and WyllieFSimultaneous fetches are allowed but no

simultaneous writes (also called CRE W RA M).(3) W RA M-W RA M denotes a variety of models that allow simultaneous


5/16

134 BORODIN ND HOPCROFTreads a nd (certain) writes, but differ in how su ch write conflicts a re to be resolved(also called CRCW RAM).

(a) (Shiloach and Vishk in) A simultaneou s write is allowed only if allprocessors are trying to write the same thing, otherwise the computation is notlegal.

(b) An arbitrary processor is allowed to write.(c) (Goldschlager) The lowest numbered processor is allowed to write.

For the purpose of comparison with an n-node fixed connection network, weassume the global memory models have n processors. It is obvious that even theweak est of the above m odels, the PRAC , can efficiently simulate a fixed connectionnetwork by dedicating a memory location for each directed edge of the network.Conversely, Lev, Pippenger, and Valiant [18] observe that a fixed connectionnetwork capable of (partial) routing in r (n ) time, can simulate one step of a PRACin time O (r (n ) ) , where now n bound s both the number of processors and the num-ber of global memory cells. Furthermore, if we assum e that the network is capableof so r t i ng in r (n ) time, then one can also simulate one step of a PRA M or W RA Min time O (r (n ) ) . The idea for this simulation was sketched in a preliminary versionof this paper (Borodin and Hopcroft [3]). A more comprehensive approach isgiven by Vishkin [30].As can then be expected, sorting also provides a way to simulate a PR AM orW RA M by a PRA C w ithout changing the processor or memory requirements. Thatis, if the original mach ine operates in time t using n processors and m globalmemory cells, then the simulating PR AC operates in time 0(t. r (n)) using the samenumber of processors and memory cells. A more straightforward 0(t * og n ) timesimulation can be achieved using a tournam ent but here we mu st increase theglobal memory by an O(n ) factor, although the number of processors does notincrease.

By these observations, and by using the Ajtai et al. [2] or Batcher sort, one seesthat all these mod els have time complexities within an O(log n ) (resp. O(log* n))multiplicative factor. These observations then closely follow the spirit of Cook [S]who distinguishes between fixed connection mod els (e.g., circuits, alternating Tur-ing machines) and variable connection mod els (e.g., SIM DA GS) in the context ofdeveloping a complexity theory for the parallel com putation of Boolean functions.For all the mod els he considers, one has variable connection machine(time t) < fixed connection machine (time t2). We note that Cook [S] only con-siders mod els when the number of processors n is at mo st exponential in t. Withoutcertain restrictions (e.g., see the precise definition of a SIM DA G in Goldschlager[ 11 I), n can be arbitrarily large relative to t.A L ow er Bound fo r a Spec i a l Case

It turns out to be surprisingly difficult to analyze reaso nable simple strategies. Weare able to show, however, that a very simple class of strategies, including the naive


6/16

PARALLELMODELSOFCOMPUTATION 135strategy, cannot work well in the worst case, this being the case for a wide class ofnetworks. Specifically, w e study ob l i v i ous s t ra t eg ies , where the route of any packet iscompletely de termined by the (origin, destination) of the packet. Obliviousstrategies are almost, but not quite, local by definition; we migh t still determinewhen a processor sends a packet along an edge by glocal means. By its very nature,oblivious strategies tend to be very easy to implem ent. We can further motivate theuse of such strategies by calling attention to the processor-mem ory interconnectionnetwork of the NYU-Ultracomputer (see Gottlieb, Lu bachevsky, and Rudolpf[ 121) which requires an oblivious strategy for the memory arbitration scheme .THEOREM 1 I n any n -node netw o rk hav i ng i n -deg ree d , the t i me requ i r ed i n the

w o rs t case by any ob l i v i ous rou t i ng s t r a tegy i s 12(& /d3 ) .P roo f . For motivation, first consider a more restricted class of protocols,

namely those w here the next step in the route of a packet depends only on itspresent location and its final destination. For this class the set of routes for a packetheading for a given destination forms a tree. To see this observe that at any vertexthere is a unique edge that the packet will take. Following the sequence of edgesfrom any vertex must lead to the final destination.

Now , for the given definition of oblivious routing, the packets heading for a givendestination no longer form a tree. Instead, we have a directed, not necessarilyacyclic, graph G with a distinguished vertex v0 (i.e., the destination). Furthermore Ghas in-degree d , and there are n distinct (in terms of their origin) but not n ecessarilydisjoint rou tes (=distinguished paths) in G terminating at vO. Let us call such agraph a (d , n ) dest in a t io n graph . If u is the origin of a route that enters v, we willcall u an o r i g i n fo r v .

The goal is to find a vertex through which m any, say t, packets must traverse.Since at most d packets can enter this vertex at a given time, a delay of at least t / dwould be necessitated.

If we superimpo se the n destination graph s determined by the n possibledestinations for packets, each vertex is on n graphs. In order to force a packetheaded for a given destination to go through a specific vertex v, we mu st initiallystart the packet on a vertex that is an origin of v in the particular destination graph.But there may be only a small num ber of such origins of v and if many destinationgraphs have the same set there w ill not be enough descendents to start each packe tat a distinct vertex. Hence we would like to have a vertex v with the property that vhas many origins in many destination graph s. To this end, we need

LEMMA 1 L et G = (V, E) be a (d , n ) des t i na t io n graph and l et T(d , k , n ) be them i n im um numbe r o f ver t i ces hav i ng k (k 3 1) o r mo re o r i g i n s i n G. Then T(d , k , n ) >n / (d (k+ 1 ) + 1).

Proo f : (The proof we use here is due to Paul Beam .). Let S = {v 1v E V and vhas at least k origins}. Let s = the cardinality of S. Now a route mu st eitheroriginate in S or V-S. Hence n is less than or equal to s plus the num ber of routesoriginating in V-S. We wish to count the routes originating in V-S as they enter S571/30/l-IO


7/16

36 BORODIN ND HOPCROFT

for the first time. Since the in-degree of G is d , for any node u E S there are at most dnodes u such tha t (u, u) E E . N ow any u E V-S can contribute at m ost k - 1 routesto the count. Hence the number of routes originating in V--s s at most s. de (k - l),since s bounds the number of entry points. Hence n n / d k - - l)+ 1). 1

We are now ready to complete the proof of Theorem 1. In each destination graphmark those vertices that have at least k origins. Let k = $$. In the network assigna count to each vertex indicating the number of destination graphs in which thevertex is mark ed. Since at least n / d k vertices are marked in each graph, the sum ofthe counts must be at least n2 /dk . Therefore, the averag e cou nt (over all vertices) isat least n / dk = m which implies there is at least one vertex u which has at leastJnld origins in each of @ destination graphs. For each of these destinationgraphs we can place the corresponding packet at some vertex of the network sothat it will pass through vertex u. Thus J;;Td packets will go through u. Since u is ofdegree at most d , routing requires time at least equal to $@. 1

The question of when there exists a corresponding upper depends on the par-ticular structure of the network. One important parameter in addition to the degreeis the diameter of the graph. Clearly if the diameter of the graph is n we cannothope for a O(h) algorithm. However, even for some O(log n ) diameter graphswith degree 2 we cannot achieve an O(& ) algorithm since there may be anisthmu s or other type of bottleneck. However, for many structures there areoblivious routing algorithms that achieve or approach this lower bound.Lang [17] has given an O(h) oblivious routing a lgorithm for the shuffleexchange network, which is then asymptotically optimal since this network has con-stant degree. For the d= l o g n dimen sional hypercube, the obvious obliviousalgorithm will route packets for any permutation in time O(h). (Since the hyper-cube has degree log n this is a factor of (log n)3 2 from optimal.) Namely order thedimension s xi, x2 ,..., xlogn. At time 1 transmit any packet whose destination is onthe opposite subcube determined by the xi dimension. This may result in as manyas two packets being on a vertex. Next transm it packets across the x2 dimension,etc. After l log n dimen sions there may be as many as ,/$ packets at a vertex caus-ing a delay of 4. In each subsequent dimension the maximum number of packetsthat can be on a vertex h alves, with eventually each packet arriving at itsdestination. It may be possible to improve this bound to &/log n if one partitionsthe & packets that could go through a vertex u into log n groups and routes themby different edges.Another restriction on routing is minimality. A m i n ima l r o u t i n g scheme forbidstransmitting along an edge if it increases the distance of the packet from itsdestination. Thus every packet mu st follow a shortest path. For minim al routingschemes it is an interesting open problem whether there is a local (or even a global)routing scheme that is O( l og n ) for any r. For networks such as the d-cube weknow of no monotone scheme better than the oblivious strategy just given.


8/16

PARALLELMODELSOF COMPUTATION 137A primary question is whether there is a local routing algorithm for say the

d-cube that is better than 0(log2 n ). It is remarkably difficult to analyze some verysimple strategies. In particular, does the following algorithm or some variant of it,route an arbitrary permutation in O(log n). Consider some vertex. At a given stageas many as d packets, where d is the dimension of the cube, w ill arrive. As many aspossible w ill be sent closer to their destinations. The remaining packets will be ship-ped to vertices of distance one greater. Since packets are not allowed to build up ata vertex the intuitive effect is to enlarge bottlenecks to several vertices an d hence toallow more packets to get to their destinations in a given time. Althoughexperimentally the algorithm appea rs promising we have not been able to formallyanalyze its behavior.

III. MERGING AND SORTING ON SHARED MEMORY MODELSA H i er a r ch y o f M odel s

The shared memory mode ls usually studied a ll possess a global memory, eachce l l of which can be read or written by any processor. For the purpose of con-structing algorithms, one usually assum es a single instruction stream; that is, oneprogram is executed by all processors. However, when the processor number itselfis used to control the sequencing of steps, and some ability to synchronize control isintroduced, then the effect is that of a multiple instruction stream. The processorsare assum ed to have some local memory and each processor can execute basicprimitive operations such as } comparisons and the branchesare labeled by each of the 3k possible outcomes. It should be clear that for theproblems of concern, parallel compu tation trees can simulate any reasonableparallel m odel, and in particular, can simulate all of the aforementioned sharedmemory models.Let M denote any of these mod els. W e will be concerned with Tl _ (n , m , p) andTzr,(n , p) , the minimum number of parallel steps to merge two sorted lists of n andm elements (resp. to sort n arbitrary elements) using p processors. Typically, n = mand p is O(n) or O(n log* n). Clearly, for any problem we have

7+RAC TPRAM TWRAM, /Our m ain contribution is to establish the following two theorems:


9/16

138 BORODIN NDHOPCROFTTHEOREM . Let M deno t e t he pa ra l l el compu t a t i on t r ee m odel . Then

TE& Jn , n , cn ) is IR(log log cn - log log c) and hence Tg_(n, n, n log* n) is.Q(log log n ) fo r a l l C I . H aggkv i s t and H el l [ 141 have i ndependen t l y p rov ed s im i l a rl ow er bounds fo r me rgi ng on pa ra l l el compu t a t i on t r ees .)

THEOREM . TL t & ( n , n , n ) i s O ( l o g log n).We use Valiants algorithm, which already establishes the upper bound for the

parallel comp arison tree, but following Valiant [28], Preparata [ 191, and Shiloachand Vishk in [23], remark that a processor allocation problem mu st be solved torealize Valiants algorithm on the PR AM mode l. Hence, the problem of merging isnow resolved on all of the above-shared memory mode ls except the PR AC , forwhich we cannot improve on the log n upper bound of the Batcher merge. For thePR AC , Snir [24] has recently established an Q(log n) lower bound for searching orinsertion (and hence merging), allowing any number of processors. Hence, theasymptotic complexity of parallel m erging is now well understood for all theabove-mentioned models.

W ith regard to sorting, we have the following direct corollaries:COROLLARY . T P R A Mn, n) is O(log n log log n).COROLLARY . TPR A M (n , og n ) i s @log n ) .Clearly, Corollary 1 follows from a standard merge sort, whereas Corollary 2 is arestatement of Preparatas [19] result, which can now be stated for PR AM s using

Theorem 2. Corollaries 1 and 2 should be compared with the Shiloach and V ishkinupper bound of O(log*(n/log(p/n)) + log n) for sorting on their version of a W RA Mwith p processors. Whether or not an O(log n ) sort can be achieved on any of theRA M m odels using only O(n) processors would be resolved by the Ajtai, Komlos,and Szemeredi [2] resu lt (assum ing their algorithm can be implemented withoutloss of efficiency on these models). Without their result, Corollary 2 and Preparatas[19] O(klogn) sort using n+lk processors on a PR AC represent the best knownO(log n ) sorting algorithms for the RA M models.

With regard to lower bound s for sorting, Hagg kvist and Hell [ 131 prove that interms of the parallel compu tation tree, time less than or equal to k impliesQ(n + lk) processors are required (and this is essentially sufficient). It follows, thatfor the tree model and any of the RAM models, Q(log n / l og log n) is a lower boundfor sorting with O( n log n) processors. For O(n) processors, sZ(log n) is a triviallower bound resulting from the sequential lower bound of Q(n log n).

Cook and Dwork [6] show that Q(log n ) steps on a PR AM are required for theBoolean function x, v x2 v ... v x,, no matter how many processors areavailable. It follows that O(log n) steps on a PR AM are required for the MA Xfunction and hence for sorting. However, it is possible to achieve O (log n ) sortingon a W RA M . Indeed, it is possible to achieve 0( 1) time by using an exponential


10/16

PARALLEL MODELSOF COMPUTATION 139number of processors; e.g., use n. n processors to check each of the possible per-mutations. A major open problem is to determine the minim al num ber ofprocessors needed for an O(1) time sort on a W RA M . Stockmeyer and Vishkin[25] have shown how to simulate a t time, p processor W RA M (in particular, theSIMD AG) by an unbounded fan-in AN D/OR circuit having depth O(t) and sizepolynomial (p). By this simulation and some appropriate reducibilities, Stockmeyerand Vishk in [25] are able to use the beautiful lower bound of Furst, Saxe, and Sip-ser [9] to show that a W RA M cannot sort in O( 1) time using only a polynomial(in n) number of processors. (See also Chandra, Stockmeyer, and Vishk in [4].)A Q(log log n) Lower B ound for Merging on Valiants Model

Sequentially merg ing two lists of length n can be accomplished with 2n-1 com-parisons and this is provably optimal. Since only 2~2-1 omp arisons are necessary tomerge two such lists, conceivably in a parallel model they could be merged in time0( 1) with n processors. However, we shall show that this is not possible. E venallowing n log n comparisons per step, a depth of log log n is needed. Thepreviously stated Theorem 2 will follow as an imm ediate corollary of Lemm a 3,which we now motivate. Throughout Section III we use integer floor and integerceiling only when crucial.

Consider the process of merging two sorted lists a,,..., a,, and b,,..., b, with nprocessors. At the first step at mo st n comparisons can be made. Pa rtition each listinto 2& blocks of length t,/&. Form pairs of blocks, one from each list. There are4n such pairs of blocks. C learly there must be 3n pairs (Ai, Bj) of blocks such thatno element from the block Aj is compared with any element from the block Bj. Weshall show tha t we can select S,/% pa irs of blocks

such that i, < i,, 1 and j,


11/16

140 BOR ODIN AND HOPCR OF TB = B Bz ..., Bu r and le t E c A x B hav e 3k2 edges . Then G has a compat ib l ema t ch i n g o f ca r d i n a l i t y a t l ea st k .

P roo f : Partition the edges into 2k blocks as follows. For -k < b c k w e have ablock consisting of edges ( (A i , B i+b ) 116 i < 2 k and 16 i + b < 2k ) . In addition wehave one block consisting of all other edges. At most k2 + 2 edges fall into the blockof other edges. Thus at least 2k2 - k edges mu st be partitioned into 2 k - 1 blocks.Hence at least one block must have at least k edges and these edges form a com-patible matching.

L EMMA 3 L et T(s , c) be t he t im e necessary t o so lv e k , k > 1, merg ing prob lems o fs ize s w i t h cks pr ocessor s. Th en T(s, c) i s s2(log(log SC/log c)).

ProoJ On the average we can assign cs processors to each problem. At least onehalf of the problems can have no more than twice this number of processorsassigned to them. That is, at least k /2 problems have at most 2 cs processors.

Consider applying 2cs processors to a problem of size s. This mean s tha t in thefirst step we can make at most 2cs comparisons. Partition the lists into 2,/?%sblocks each of size J&% There are 8cs pairs of blocks. Thus there must be 6cspairs of blocks w ith no comparisons between elements of the blocks in a pair. Con-struct a bipartite graph whose vertices are the blocks from the two lists with anedge between two blocks if there are no comparisons between elements of the twoblocks. Clearly there are 6cs edges and thus by the previous lemm a there is a com-patible m atch of size at least 4,/%. This means that there a re at least, $,/%problems each of size at least i,,&% that w e mu st still solve. Thus T(s, c) 2 1 +T(?i& ?% , 4~).We show by induction on s, that

log SCT(s, c) B d log -log cfor some sufficiently small d .

T(s, c) > 1 + d log log &/(s/2c) 4clog 4c

(w.l.0.g. assume c > 4)log SCB 1 +d l o g - log c dlog4

log SC>dlog- log cprovided d < f . Observe that log(log se/log c) is Q(log log s - log log c) whichmatches Valiants upper bound of 2(log log s - log log c).


12/16

P R LLEL MODELS OF COMPUT TION 141

An O(log log n ) U pper Bound f o r M erg i n g on a PRA MWe recall Valiants (n , m) merging algorithm which merges X and Y withX= n , # Y = m, n < m using fi processors. Our goal is to implem ent Valiants

algorithm on a PRA M. However, we shall require IZ+ m processors rather thanJ--nm as in Valiant, because of the data movem ent which need not be accounted forin the tree model. The algorithm (taken verbatim from Valiant [28]) proceeds asfollows:

(a) Mark the elements of X that are subscripted by i . $1 and those of Ysubscripted by i . r& J for i = 1, 2,.... There are at most LJ n J and L$ J of these,respectively. The sublists between successive marked elements and after the lastmarked element in each list we call segments.

(b) Com pare each marked element of X with each marked element of Y. Thisrequires no more than LrJ nm comp arisons and can be done in unit time.

(c) The comp arisons of (b) will decide for each marked element the segmentof the other list into wh ich it needs to be inserted. Now compare each markedelement of X with every element of the segment of Y that has thus been found for it.This requires at most

L, hJ T +WL Jcomp arisons altogether and can also be done in unit time.

On the completion of (a), (b), and (c) we can store each xir~, in its appropriateplace in the output Z. It then rema ins to merge the disjoint pairs of sublists(X0, Y,,), (Xi, Yi),..., where the X i and Yi are sequences of X and Y, respectively.Whereas C auchys inequality guarantees that there will be enough processors tocarry out these independent merges by simultaneou s recursive calls of thealgorithm, it is not clear how to inform each processor to which (X,, Yi) sub-program (and in wha t capacity) it will be assigned . This is the main concern inwh at Shiloach and Vishkin [23] refer to as the processor a l l o ca t i on prob l em .

We desire a recursive procedure (in fact, a macro migh t be more appropriate)MERGE(i, ni,j, mj , k ) which merges xi, xi+, ,..., xi+,,_r and yj ,..., Y~ +~,_ ~ intozi j- I,..., Zi j ni mj_ 2 using at most ni + mj processors beginning at processornumber pk . Such a merge w ill be simultaneously invoked by processors p k , p k + 1 ...Pk +n,+m,-l The initial call is MERGE( 1, n , 1, m , 1). As we enter this subroutine, aprocessor pk will know from i , j , n i , m j , and k , the (relative) role it plays in par ts(a), (b), and (c) of Valiants algorithm. For exam ple, say n ,< m j and let

I = k + i r & l + j , O


13/16

142 BOR ODIN AND HOPCR OF TT A BL E J

0 jl j . _iJnT m

mined for each i, 0 6 i < L & J - 1 that yj,


14/16


15/16

144 BORODIN AND HOPC ROFTO(l), parallel time sorting on a W RA M . Further progress (beyond [9, 253) on thisproblem will almost surely have to awa it a better u nderstanding of the more basiccircuit models. The general question of tradeoffs between parallel time and thenumber of processors (or hardw are as referred to by Cook and Dymond [7]) isclearly a major theme in complexity theory.

ACKNOWLEDGMENTS

We are indebted to U . Vishkin for observing that our original claims about merging on a PRAMcould not hold because of the amount of data movement required by our formulation of the problem.We thank P. Beam for his clear proof of Lemm a 1, and we also thank R. Prager and L. Rudolph formany helpful comments.

REFERENCES

1. R. ALELIUNAS Randomized parallel communications, in Proc. of ACM Sympos. on Principles ofDistributed Computing, Ottawa, Canad a, 198 2, pp. 672 .

2. M. AJTAI, J. K OML~~AND E. SZEM ERED I, n O(nlogn) sorting network, in 15th Annual ACMSympos. on Theory of Comput. Boston, Mass., 1 983 , pp. l-9.

3. A. BORODIN ND J. HO PCR OFT , outing, merging, and sorting in parallel models of computation, in Proc. 14th Annual ACM Sympos. on Theory of Cornput. San Francisco, Calif., 1982 , pp. 338 344 .

4. A. K. CHANDRA,L. J. STOCKMEYER,ND U. VISHKIN,Complexity theory for unbounded fan-inparallelism, in Proc. 23rd Annual IEEE Sympos. on Found. of Compu t. Sci., 1982, pp. 1-13.

5. S. A. COOK , Towards a complexity theory of synchronous parallel com putation, Enseig n. M i. (2)27 (1981), pp. 99-124.6. S. A. Coon AND C. DWORK, Bounds on the time of parallel RAM s to compute simple functions, in Proc. 1 4th Annual ACM Sympos. on Theory of Comput., 1982 , pp. 231-2 33.

7. S. A. COOKAND P. DYMOND,Hardware complexity and parallel computation, in Proc. 21st AnnualIEEE Sympos. on Foun d. of Compu t. Sci., 1980, pp. 36(X 372.

8. S. FORTUNE ND J. WYLLIE,Parallelism in random access machines, in Proc. 1 0th Annual AC MSympos. on Theo ry of Comput., San Diego, C alif., 1978 , pp. 114-1 18.9. M. FURST,J. B. SAXE,AND M. SIPSER,Parity, circuits and the polynomial-time hierarchy, in Proc.22nd Annual IEEE Sympos. on Found. of Comp ut. Sci., 1981, pp. 260-2 70.

10. Z. GALIL AND W. J. PAUL, An efficient general purpose parallel computer, in Proc. 13th AnnualACM Sympos. on Theory of Comput., Milwau kee, Wise., 1981 , pp. 247-2 56.

11. L. GO LDSC HLA GER , unified approach to models of synchronous parallel machines, in Proc. 10thAnnual ACM Sympos. on Theory of Comput., San Diego, Calif., 197 8, pp. 89-94.

12. A. GOTTLIEB, . D. LUBACHEVSKY,ND L. RUDOLPH,Coordinating large number of processors, in Proc. of Int. Conf. on Parallel Processing, 198 1, pp. 341-3 49.13. R. HAGGKVISTND P. HELL , Parallel sorting with constant time for comparisons, SIAM J. Compu t.10 (3) (1981), 465472 .

14. R. HAGGKVISTND P. HEL L, Sorting and Merging in Rounds, Compu ting Science technical reportTR8 1-9, Simon Fraser University, Burnab y, B.C., Cana da, 198 1.

15. D. E. KN UTH , The Art of Com puter Programming, Vol. 3: Sorting and Searching,Addison-Wesley, Reading, Mass., 197 2.

16. C. P. K RUSK AL,Searching, merging, and sorting in parallel computation, IEEE Trans. Comput.,198 3, in press.

17. T. LANG, Interconnections between PE and MB s using the shuffle-exchange, IEEE Trans. Compu t.C-25 (1976), 496.


16/16

PARALLEL MODELS OF COMPUTATION 45

18. G. LEV, N. PIPPENHER ND L. G. VALIANT,A fast parallel algorithm for routing in permutationnetworks, IEEE Trans. Comp ut. C-30 (1981), 93-10 0.

19. F. P. PREPA RATA , ew parallel-sorting schemes, IEEE Trans. Comput. C-27 (7) (1978) 669-673.20. F P P~EPARATAND J. VIULL EMIN , he cube-connected cycles, in Proc. 20th Sympos. on Found.

of Comp ut. Sci., 19 79, pp. 140-147 .21. J. REIF AND VALIANT,L. G., A Logarithmic Time Sort for Linear Size Networks, Aiken Com-putation Laboratory, TR1 3-82, Harvard U niversity, 198 2.

22. J. T . SC HW AR TZ, Ultracomputers, ACM Trans. Programming Lang. Systems 2 (1980), 484521.23. Y. SHILOACHAND U. VISH KIN, Finding the maximum , merging, and sorting in a parallel com-

putation model, J. Algorithms 2 (1) (198 1) 88-10 2.24. M. SNIR, On Parallel Searching, Department of Compu ter Science technical teport TR O45 ,

Couran t Institute, New York University, June 1 982.25. L. J. STOCKMEYERND U. VISHKIN, Simulation of Parallel Random Access Machines by Circuits,

RC -9362, IBM T. J. Watson Research Center, Yorktown Heights, N.Y., 1982 .26. H. STO NE, Parallel processing with the perfect shuffle, IEEE Trans. Comput. C-20 (2) (1971),

153-161.27. E. UPFA L, Efficient schemes for parallel commun ication, i n Proc. of ACM Sympos. on Principlesof Distributed Com puting, Ottawa, Canad a, 1 982.28. L. G. VALIANT,Parallelism in comparison problems, SIAM .I. Comput. 4 (1975) 348-355.29. L G VALIANTAND G. J. BREBN ER,Universal schemes for parallel computation, in Proc. 13th

Annual ACM Sympos. on Theory of Compu t., Milwauk ee, Wisconsin, 1981 , pp. 263-2 77.30. U. VISHK IN, A Parallel-Design Space Distributed-Implementation Space (PDD I) General Purpose

Computer, RC -9541, IBM T. J. Watson Research Center, Yorktow n Heights, N.Y.. 198 3.

routing merging sorting

Documents