multiprocessing systems - ieee computer society · baer: multiprocessing systems main memory (160...

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-25, NO. 12, DECEMBER 1976

Multiprocessing Systems

JEAN-LOUP BAER, MEMBER, IEEE

Abstract-This paper surveys the state of the art in the designand evaluation of multiprocessing systems. Multiprocessor archi-tectures of the SIMD and MIMD type are reviewed and furtherclassified depending on their tight or loose coupling and their ho-mogeneity. The additional complexity of the software for control,synchronization, efficient utilization, and performance monitoringof multiple processors is emphasized.

Index Terms-Array processors, control, multiprocessors,pipeline computers, synchronization of concurrent processes, tightand loose coupling.

I. INTRODUCTIONr HERE are two definitions that have been used with

- respect to multiprocessing systems. The ANSI Vo-cabulary of Information Processing defines a multipro-cessor, broadly, as a system composed of two (or more)processing units under integrated control. A more re-

strictive attitude is to define multiprocessing as the si-multaneous processing of two (or more) portions of thesame program by two (or more) processing units. In thefollowing, processing units will be arithmetic-logical unitsand we do not consider the case of multicomputers, thatis computer systems linked through either secondarymemory or an input/output device. This will also excludenetwork of computers from the scope of this paper.

Every major manufacturer offers a multiprocessingsystem which can fit the broad definition given above. Acomprehensive tabulation can be found in [9]. On the otherhand, the number of multiprocessors adhering strictly tothe second definition is quite limited. Most of these mul-tiprocessors are either intended for special-purpose ap-

plications (e.g., array processors) or their processing unitsare tightly coupled and consist of functional units whichperform dedicated tasks (e.g., CDC 6600, CDC 7600, andIBM 360/91). With the advent of cheaper hardware andhence cheaper processing units, the trend to distributetasks over several components of the system is a naturalone. The next section of this paper will give a brief reviewof possible multiprocessor organizations.Major problems in the efficient utilization of multi-

processors reside in the design of the software. In thegeneral-purpose multi-arithmetic-logical unit configura-tion the difficulty is mainly in the implementation of anintegrated control within the operating system. For ex-

Manuscript received April 12, 1976; revised June 21, 1976. This workwas supported by the National Science Foundation under Grant GJ-41164.The author is with the Department of Computer Science, University

of Washington, Seattle, WA 98195.

ample, synchronization, task splitting, and scheduling areareas where the presence of more than one processing unitincreases the supervisor's complexity. Performancemonitoring and evaluation is also more complex than ina single machine environment. In the more special-purposesystems, basic (e.g., compilers), and application softwarehave to be tailored to the specific machine realization.Section III will be a discussion of some of the problemsarising in these situations.

In this survey we shall reference only seminal and/orcurrent papers. Extensive bibliographies can be found in[2], [15].

II. MULTIPROCESSING SYSTEMS ARCHITECTURE

In a widely referenced paper, Flynn [11] has classifiedcomputer systems in four categories.

1) Single-Instruction Single Data (SISD): These arethe usual uniprocessor computers.

2) Single-Instruction Multiple Data (SIMD): Furthercategorization is needed and will be provided in this sec-tion.

3) Multiple-Instruction Single Data (MISD): To ourknowledge, there exists no system which can be labeleduniquely this way. The pipeline, or vector, processorsdiscussed below could be included in this category.

4)' Multiple-Instruction Multiple Data (MIMD): Thisgeneral organization encompasses several subdivisionswhich will be reviewed shortly.

In SIMD architectures, a single-control unit fetches anddecodes instructions. Then the instruction is executedeither in the control unit itself (e.g., a jump instruction)or it is broadcasted to some processing elements. Theseprocessing elements operate synchronously but their localmemories have different contents. Depending on thecomplexity of the control unit, the processing power andaddressing method of the processing elements and theirinterconnection facilities, one can distinguish betweenarray processors, processing ensembles, and associativeprocessors. In array processors the control unit has limitedcapabilities and elements communicate with their neigh-bors [for a good discussion of the possible interconnectionschemes and their complexity (see [21])]. In processingensembles the control unit is, in general, a full-fledgecomputer and in order to communicate the processing el-ements have to pass their messages through the controlunit. In general, each element will operate on an associativememory. This is mandatory in the case of associativeprocessors for which the associative memories are largerand interconnections are more extensive. Because the

1271

IEEE TRANSACTIONS ON COMPUTERS, DECEMBER 1976

Routing Network

Comson Data Bus (Address and com. operands)I I I

2> 1 I1

Fig. 1. Illiac IV.

concept of associative processing requires a large amountof hardware (logic at each bit) it has not yet been provencost-effective except for very specific subfunctions withina complex central processing unit (e.g., virtual memorymapping). Also, since the architecture implied in thatmode of operation is quite different from that of othermultiprocessing systems, we will not discuss it further andrefer the reader to an excellent recent survey [22].The paradigm of an array processor is Illiac IV (cf., Fig.

1). The control unit does not have any arithmetic capa-bilities except for indexing and its memory is limited to aninstruction stack for lookahead purposes. It controls 64processing elements, each one being a complete arithme-tic-logical unit to which is associated a 2K 64-bit/wordmemory. The intercommunication facility can be visual-ized as an 8 X 8 ring structure, i.e., each processing elementcan communicate with 4 neighbors (element i can route inone instruction the contents of one of its internal registersto any of elements i + 1, i - 1, i + 8, or i - 8). Because ofthe constraints on the synchronous operation of the pro-cessing elements and the limited capability of the controlunit, the operating system resides in a satellite computer(initially a Burroughs B6700, now a PDP-10) which con-trols program loading and input/output operations, andwhich also provides the basic software (assemblers, com-pilers) for the array processor.The raison d'etre of Illiac IV is in the enormous amount

of computing power that results from the parallel activa-tion of 64 processors. But the range of applications is se-verely limited by the synchronous operation of these pro-cessors. Some often cited examples of cost-effective use ofarray processors such as Illiac IV are weather prediction(and in fact any solution of systems of partial differentialequations or grid problems), matrix manipulation andsolution of linear systems of equations (e.g., in nuclearenergy applications), and air traffic control. The use ofarray processors for data base management (possibility ofparallel queries) and data processing (concurrent pro-cessing of records with identical structure) has also beenadvocated.

The pipeline processors can be considered either SIMD,MISD or MIMD architectures. Pipeline processing isanalogous to an assembly-line organization. The compu-tational power is segmented into consecutive stations.Processes are decomposed in subprocesses which have topass through each station or stage. All stages must have thesame processing time t and this synchronization has to beenforced through worst case analysis and appropriatelatches. If there are p stages then one operation will takept, but k consecutive operations will take pt + (k - 1)t,delivering a result at every interval t.

Pipelining, or overlap, has been existent in computersystems for a long time, e.g., at the instruction fetch(lookahead) level. Pipelined functional units have ap-peared in the IBM 360/91 with a two-stage floating-pointadder and a multiply unit where the pipelining is enforcedin the iterative hardware. Other recent realizations arediscussed in [5].Two of the most powerful existing computers the CDC

STAR-100 and the TI Advanced Scientific Computer(ASC) (cf., Fig. 2) are based on a general pipeline design.Their instruction set is heavily oriented towards vectoroperations. In both systems, several pipeline units canfunction in parallel. In the ASC they are identical andhence not all mandatory, while in the STAR-100 theyperform specific functions.

Returning to our comment on the difficulty of classifyingpipeline computers, we can either call them SIMD if weconsider that a single instruction, e.g., a two-stage float-ing-point add, treats simultaneously (two) different itemsof data; of MISD if we consider instructions in consecutivestages to be different but working on the same aggregate(vector) of data; or MIMD by combining the above twointerpretations.The design and effective utilization of pipeline proces-

sors present interesting challenges. Chen [51 and Rama-moorthy and Li [18] have written authoritative papers onthe topic. It appears that the concept is quite powerful andit can be applied equally well to some small units such asfloating-point adders and to large ones such as complete

1272

BAER: MULTIPROCESSING SYSTEMS

Main Memory

(160 ns)

MB: Memory Buffer

AU: Arithmetic Unit

Fig. 2. Texas ASC computer.

central processing units sharing a common memory or

networks of microcomputers.We now consider MIMD architectures. Two features are

of interest to differentiate among designs: the coupling or

switching of processor units and memories, and the ho-mogeneity of the processing units.

In tightly coupled multiprocessors, the number of pro-cessing units is fixed, and they operate under the super-vision of a strict control scheme. Generally, the controlleris a hardware unit. (Note that this tight-coupling is alsopresent in array processors and pipeline computers.) Mostof the hardware controlled tightly coupled multiprocessorsare heterogeneous in the sense that they consist of spe-cialized functional units (e.g., adders, multipliers) super-vised by an instruction unit which decodes instructions,fetches operands, and dispatches orders to the functionalunits. Typical examples of these systems are the CDC 6600,CDC 7600, and IBM 360/91, with the CDC STAR-100 andTI's ASC being also in this category. When the controllingscheme is provided by the operating system, as in IBM'smultisystems 360 and 370's or UNIVAC's 1108 and 1110,the processors will generally perform unrelated tasks andthe "multiprocessing" is of the type implied in the broaddefinition given in the introduction.The trend towards tight-coupling is apparent only in

supercomputers. (The CRAY-1 is reportedly of that typeas is Flynn's proposal for a shared resource multiprocessor[12].) In recent homogeneous multiprocessing systems anddistributed function architectures one sees a looser andmore modular type of connection being realized. Three

main types of organizations are possible for the proces-sor-memory switching apparatus.

1) The Crossbar Switch: This is the most extensive, andexpensive, scheme providing direct paths from processorsto memories (cf., Fig. 3). If there are m processors and nmemory modules, i.e., a concurrency of min (m,n), thecrossbar requires m X n switches, i.e., if m n the numberof crosspoints grows as n2. Since each crosspoint must havehardware capable of switching parallel transmissions andof resolving conflicting requests for a given memorymodule, the switching device becomes rapidly the domi-nant factor in the cost of the overall system. Nevertheless,such a matrix has been proposed for C.mmp [24], a mul-tiprocessor consisting of 16 slightly modified PDP-11'sconnected to 16 memory modules (at the present time onlya 5 X 5 system is operational). In addition to the sharedmemory, each processor has a private store. Communica-tion between processors is through another type ofswitching network, a multiplexed time-shared bus whosestructure is similar to the one described now.

2) Time-Shared Buses: There are several degrees ofcomplexity in this organization depending on the numberof buses. The simplest one is to have all processors,memories, and input/output units connected to a singlebus (cf., Fig. 4) which can be totally passive. Then of coursethere is no possibility of concurrent transactions. The sameprinciple can be applied to (virtual) processors sharing acommon memory (and even some circuitry) in a synchro-nous multiplexed fashion. This is the architecture chosenfor the peripheral processors of the CDC 6600 and, morerecently, the virtual processors of the TI ASC.To attain more parallelism, at the price of more com-

plexity, one can have several buses, either uni- or multi-directional. Priorities can be given to specific units if oneadds a bus arbiter to resolve them.

If one wants to connect a large number of processors(e.g., microprocessors) and small memory modules, thetime-shared multibus techniques seem most appropriate.Arden and Berenbaum [1] have proposed a switching logicbased on a binary search tree for the memory network anda single-bus with an arbiter for. connection between pro-cessors. The bus arbiter scheme has also been chosen forthe Minerva multimicroprocessor [23] with a single bus fortransmission between devices but with each device havinga direct connection to the arbiter.

3) Multiport Memory Systems: In this organization theswitching is concentrated in the memory module. Eachprocessor has access through its own bus to all memorymodules and conflicts are in general resolved throughhardwired fixed priorities.

Evidently, one can mix some of these strategies. Forexample, Fig. 5 shows an abstraction of a multiprocessor.used as a modular switching node for the ARPA network[14] (the input/output buses are now shown). We can seethat we have in essence seven dual processors sharing twobanks of two memory modules in an organization remi-niscent of a multiport structure.Looking back at the evolution of multiprocessing sys-

1273


I1

n

\- ok

--.-t -^

\

. _

Fig. 3. Crossbar switch.

tems, one can see a general trend. When the goal of thesystem is raw computing power, the architecture will beof the tight-coupled type, either SIMD for more specializedapplications or MIMD with custom-designed functionalunits, or a mixture of both such as in pipeline processors.

The coupling is still tight, but not as much, when a ho-mogeneous multiprocessor is designed with a general ap-

plication in mind but with many variations being possible.This is the case for C.mmp intended primarily for a real-time speech understanding system and for the multipro-cessor used as a switching node in the ARPA network.When the system is going to consist of numerous smallmodules and when it is intended for general-purpose ap-

plications, the connections are necessarily of the loose type.In this latter case, the challenge facing computer architectsis one of organization. Processors should be able to access

memories, with as much concurrency as possible, theyshould communicate with each other in a generalized waybut also one should be able to configure them dynamicallyin a specialized way. Investigations of these distributedfunction systems and reconfigurable computers are inprocess and further departures from the von Neumannarchitecture will be seen shortly since the cost of processorsand memories is decreasing drastically.

III. SOFTWARE REQUIREMENTS

As mentioned earlier, multiprocessing systems add levelsof complexity for both the operating system and the ap-

plication software.When array processors and pipeline computers are

considered, the SIMD structure imposes some constraintsthat make a direct transplantation of conventional(multiprogrammed) operating systems extremely ineffi-cient. For this reason either a satellite computer, as in thecase of Illiac IV, or virtual (peripheral) processors, as in the

TI ASC, are almost mandatory. Programming languagesshould be modified both in their control and their datastructures. For example, in the array processor case, to takeadvantage of the synchronous operations of processingelements, statements like DO PARALLEL, IF ALL, IF ANYetc ... have been advocated. But it appears that the dif-ficulties in producing efflcient compiled code for languageswith such extensions have slowed their development andone current trend is to let the applications programmerbear the burden of designing his algorithm with the im-plied knowledge of the machine architecture, while pro-viding an efficient means of distributing his data in theelements' memories [17]. In the case of pipeline processors,vector operations have been added to algorithmic lan-guages such as Algol-60 or Fortran, but it is somewhatsurprising that a vector language such as APL has not beenmore widely used.

In the MIMD architecture, control, synchronization, andscheduling of the processors are sensitive areas. Two maintypes of control can be envisioned. In the fixed mode, oneor more processors are dedicated to execute the operatingsystem. When some other processor terminates its currenttasks, or when all processors are busy and a higher prioritytask has to be initiated, it is the responsibility of the ded-icated processor(s) to schedule, terminate and/or initiateprocesses. An advantage of such a scheme is that special-purpose hardware, like e.g., associative memory, can beembedded in the design hence decreasing the executive'soverhead. The main disadvantage is that failure in thededicated processor brings the whole system to a halt,unless there are several identical controlling processors(like in the TI ASC to some extent). In the floating controlmode, each processor can have access to the operatingsystem and schedule itself. For reliability, graceful deg-radation and protection between users this is certainly abetter approach [10]. The overhead in memory contentionand synchronization could appear to be superior to thatof the fixed control scheme but there is less centralizedmessage passing. Gonzalez and Ramamoorthy have pre-sented an interesting case study [13] leading them toconclude on the superiority of decentralized systems.However, they have not considered the introduction ofspecialized hardware and a general comparison remainsan area of further study.The synchronization of concurrent processes has been

a topic of intense investigation since Dijkstra's famousletter [6]. Numerous solutions for the harmonious- coop-eration of sequential tasks addressing common variableshave appeared. They can be classified according to thefollowing.

1) The use of special primitive operations such asDijkstra's semaphores [7].

2) The priority scheme given to the processes.3) The fairness of scheduling, i.e., what is the maximum

amount of time that a process has to wait before enteringits "critical section" (i.e., the portion of code where itshould find a stable environment and be the only one ableto modify it).

1274

p2

B3AER: MtJLTIPROCESSING SYSTEMS

Bus

Fig. 4. Time-shared bus.

BA: Bus Arbiter

mi: small memory module

M.: large memory module1F

Fig. 5. Minimultiprocessor connection scheme.

4) The graceful degradation of the system if one of itscomponents (processors) fails.

Solutions using semaphores or their extensions haveappeared frequently in the literature. An interesting ap-proach using a polling scheme, no semaphore, presentinga first-in-first-out fair schedule and graceful degradationhas been given by Lamport [16]. One can compare thesetwo philosophies when applied to a specific problem, theconcurrency of list processing and garbage collection, byconsulting two recent papers [8], [19]. To this author, thesolution without semaphores is certainly more pleasing;but its efficiency is to be determined.

If one wants to take advantage of the multiprocessorstructure within a single program, i.e., realize the strictdefinition of multiprocessing, the conventional programstructure has to be modified. Three nonmutually exclusiveoptions are open: design algorithms such that they takeinto account the parallel architecture (e.g., this is man-datory for array processors); express potential parallelismin the high-level and assembly languages; and detect au-

tomatically the parallelism.When designing new algorithms for parallel environ-

ments, e.g., on a multiprocessor with n processors oper-

ating concurrently, an ideal situation would be to performn times "better" than a uniprocessor. This word "better"has to be qualified and we have to define measures ofperformance. Two metrics are possible: the completiontime of a given program and the throughput of the wholesystem when the multiprocessor is also multiprogrammed.For these criteria, the ideal situation is not generally at-tainable because in particular:

1) In a given program, the amount of parallelism is notuniform and it is seldom that the n processors can be keptconsistently busy.

2) The n processors share resources, such as e.g., mainmemory, and the contention will, in most cases, degradeperformance.

3) The transposition of serial algorithms to parallel onesdoes not yield necessarily a theoretical speedup of n.An interesting case for the latter argument is the (in-

core) sorting problem. It is well known that a lower boundon the maximum number of steps (comparison and ex-changes) to sort N numbers is of the order of Nlog2N, i.e.,O(NlogN). Assume now that we are in a multiprocessingenvironment where we are allowed to make several com-parisons and exchanges in parallel, but each one has to be

1275

pI p2 pm 1/0 I . . .T/O

p

mI

m2 Mn


handled by a separate processor, i.e., we have only two-waycomparisons. Hence, the maximum number of usefulprocessors at a given time is N/2. Stone [20], following onBatcher's work [4], has shown that with N/2 processorsinterconnected through a "perfect shuffle" network (i.e.,one which allows processor i to route its result to processor2i if i is even and to processor 2i + 1- N if i is odd) thelower bound on the number of comparisons and exchangesis now 0(N(logN)2) with no idle processor. That is whileon a serial processor the execution time is O(NlogN), onN/2 processors the execution time becomes 0((logN)2)instead of the intuitively wrong guess of 0(logN).

Expressing concurrency of paths in the code can be donevia FORK-JOIN at the assembly language level or PARBE-GIN-PAREND in the high-level language environment. Ofcourse the difficulty resides in the mutual exclusion ofcritical sections and we have already mentioned the pos-sible alternatives.

Detecting parallelism automatically, that is during thecompilation, is a process which has required considerableattention during the last decade. Theoretical and practicalstudies have been done, mostly on the parallelism at theinstruction level [2], [15]. Kuck and his coworkers [15] haveshown that by using this static approach even on smallordinary Fortran programs, many functional units couldbe kept busy. They advance a figure of 16 as being a rea-sonable target. Although their automatic analyzer is quitecomplex, the scheduler is very simple and more could begained by looking into that area. Also, it is worthwhile tonotice that memory contention and conflicts in routingdata between processors may considerably slow down theprocess. Finally, another disadvantage is that the compileritself which performs all the detection of parallelism, anontrivial task, is a sequential process on which an auto-mated detection of parallelism fails miserably. Studies totailor the compilation process to an MIMD environmentare in progress [3].

This is not, by any means, an exhaustive list of all thesensitive areas in the design of software for multiprocessingsystems. The range and magnitude of the problems shouldconvince the reader that substantial efforts are neededbefore attaining viable, correct, and safe operating systemsas well as code efficiently tuned towards parallel process-ing. The effects of the control and synchronization ofconcurrent processes on the actual performance of MIMDmachines is not yet well understood. Methods to evaluatethe performance of multiprocessors resemble closely thoseused for multiprogrammed systems, and fall outside thescope of this paper. However, even more than in the singleprocessor environment, one would like to see the designersinclude a monitoring unit to assess on-line the gains in-curred in the parallel environment.

V. CONCLUSION

In this short survey we have briefly reviewed the archi-tecture of multiprocessor systems, the inherent additionalcomplexity in the development of their software systems,and means to assess their performance.

It has now become a cliche to write about the decreasingcost of hardware due to large scale integration and topredict a proliferation of distributed function systemsbased on cheap (micro) processors. The era of the "no-costincremental processor" has reached us. But the increasedcosts and complexity in communication, the overhead inthe operating system, the difficulties in organization forboth static configurations and dynamic reconfigurationsare factors which should at the same time be made quiteclear. Because of low-cost hardware there exists a tendencyto predict a large variety of distributed multiprocessedfunction architecture in the near future. To this author,the current challenging problems in the design of coherentarchitectures, of viable and efficient operating and database systems, and in the inclusion of evaluating tools bothduring the design process and in the completed systemitself will restrict for some time the range of useful systems.Although the trend to distribute processing and to furtherdeparture from the von Neumann concept of a monolithicstored program computer is irreversible, realizations mightnot be so close at hand as the current hardware technologypermits.

REFERENCES

[1] B. W. Arden and A. D. Berenbaum, "A Multimicroprocessor com-puter system architecture," in Proc. 5th Symp. on Operating Sys-tems Principles, pp. 114-121, 1975.

[2] J. L. Baer, "A survey of some theoretical aspects of multiprocessing,"Comput. Surveys, vol. 5, pp. 31-80, Mar. 1973.

[3] J. L. Baer and C. Ellis, "Compilation in distributed function sys-tems," COMPCON Dig., pp. 31-34, 1976.

[4] K. E. Batcher, "Sorting networks and their applications," in 1968Spring Joint Computer Conf., AFIPS. Proc., vol. 32. Washington,DC: Thompson, pp. 307-314, 1968.

[5] T. C. Chen, "Overlap and pipeline processing," in Introduction toComputer Architecture, H. Stone, Ed., SRA, pp. 375-431, 1975.

[6] E. W. Dijkstra, "Solution of a problem in concurrent programmingcontrol," Commun. Assoc. Comput. Mach., vol. 8, pp. 569-570, Sept.1965.

[7] E. W. Dijkstra, "Cooperating sequential processes," in ProgrammingLanguages, F. Genuys, Ed. New York: Academic, pp. 43-112,1968.

[8] E. W.i Dijkstra, L. Lamport, A. J. Martin, C. S. Scholten, and E. F.Steffens, "On-the-fly garbage collection: An exercise in coopera-tion," Commun. Assoc. Comput. Mach., to be published.

[9] P. H. Enslow, Ed., Multiprocessors and Parallel Processing. NewYork: Wiley-Interscience, 1974.

[10] R. S. Fabry, "Dynamic verification of operating system decisions,"Commun. Assoc. Comput. Mach., vol. 16, pp. 659-668, Nov. 1973.

[11] M. J. Flynn, "Very high-speed computing systems," Proc. IEEE,vol. 54, pp. 1901-1909, Dec. 1966.

[12] M. J. Flynn and A. Podvin, "An unconventional computer archi-tecture: Shared resource multiprocessing," Computer, vol. 5, pp.20-28, Mar. 1972.

[13] M. J. Gonzalez and C. V. Ramamoorthy, "Parallel task executionin a decentralized system," IEEE Trans. Comput., vol. C-21, pp.1310-1322, Dec. 1972.

[14] F. E. Heart, S. M. Ornstein, W. R. Crowther, and W. B. Barker, "Anew minicomputer/multiprocessor for the ARPA network," in 1973Nat. Comput. Conf., AFIPS Proc., vol. 42. Montvale, NJ: AFIPSPress, pp. 529-537, 1973.

[15] D. J. Kuck, "Parallel processor architecture-A survey," in Proc.1975 Sagamore Comput. Conf. on Parallel Processing, pp. 15-39,1975.

[16] L. Lamport, "A new solution ofbDijkstra's concurrent programmingproblem," Commun Assoc. Comput. Mach., vol. 17, pp. 453-454,Aug. 1974.

1276

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-25, NO. 12, DECEMBER 1976

[17] D. H. Lawrie, T. Layman, D. Baer, and J. M. Randall, "Glypnir-Aprogramming language for Illiac IV," Commun. Assoc. Comput.Mach., vol. 18, pp. 157-164, Mar. 1975.

[18] C. V. Ramamoorthy and H. F. Li, "Pipelined processors-A survey,",in Proc. 1975 Sagamore Comput. Conf. on Parallel Processing, pp.40-62,1975.

[19] G. Steele, "Multiprocessing compactifying garbage collection,"Commun. Assoc. Comput. Mach., vol. 18, pp. 495-508, Sept.1975.

[20] H. S. Stone, "Parallel processing with the perfect shuffle," IEEETrans. Comput., vol. C-20, pp. 153-161, Feb. 1971.

[21] --, "Parallel computers," in Introduction to Computer Archi-tecture, H. S. Stone, Ed., SRA, pp. 318-374, 1975.

[22] K. J. Thurber and L. D. Wald, "Associative and parallel processors,"Comput. Surveys, vol. 7, pp. 215-255, Dec. 1975.

[23] L. C. Widdoes, "The Minerva multimicroprocessor," in Proc. 3rdSymp. on Computer Architecture, pp. 34-39,1976.

[24] W. A. Wulf and C. G. Bell, "C.mmp-A multimini processor," in1972 Fall Joint Computer Conf., AFIPS Proc., vol. 41. Montvale,NJ: AFIPS Press, pp. 765-777, 1972.

Jean-Loup Baer (S'66-M'69) received the Di-plome d'Ingenieur in electrical engineering andthe Doctorate 3e Cycle in computer sciencefrom the University of Grenoble, Grenoble,France, in 1961 and 1963, respectively, and thePh.D. degree from the University of California,Los Angeles, in 1968.From 1961 to 1963 he was a Research Engi-

A neer with the Laboratoire de Calcul of the Uni-31W versity of Grenoble. From 1965 to 1969 he was

a member of the Digital Technology Group atU.C.L.A. Since 1969 he has been on the faculty of the University ofWashington, Seattle, where he is currently an Associate Professor. Hispresent interests are in parallel processing, the management ofmemoryhierarchies, and scheduling theory.

Dr. Baer is a member of the Association for Computing Machinery andthe Association Francaise de Cybernetique Economique et Technique.He has served as an IEEE Computer Society Distinguished Visitor during1973-1975 and is an Associate Editor of the Journal of Computer Lan-guages.

A Survey of Some Recent Contributions to Computer

Arithmetic

HARVEY L. GARNER, FELLOW, IEEE

Abstract-This paper surveys some recent contributions tocomputer arithmetic. The survey includes floating-point arith-metic, nonstandard number systems, and the generation of ele-mentary functions. The design objective for these areas is compu-tation that is both accurate and fast.

Floating-point arithmetic is universally used but is basicallyinaccurate because of errors due to round off. Experience withshort floating-point words using hexadecimal notation motivateda careful study of the error behavior of floating-point numbersystems with different bases and round-off procedures. Of par-ticular importance is the development of an axiomatic theory ofrounding which provides guidance for the design of floating-pointhardware.Nonstandard number systems are frequently suggested as means

of obtaining computation speed or avoiding round off. An inter-esting recent example is a truncated Hensel code representing afinite set of rational numbers.Research to obtain fast-division algorithms has followed the

discovery ofSRT division in 1958. More recently, techniques havebeen developed for the efficient generation of other elementaryfunctions.Computer arithmetic continues to be an active field of research

which has been stimulated by the hardware capabilities providedby large-scale integration (LSI).

Index Terms-Computer arithmetic, computer division, floatingpoint, generation of elementary functions, number systems,rounding.

Manuscript received May 7, 1976; revised July 15, 1976.The author is with the Moore School of Electrical Engineering, Uni-

versity of Pennsylvania, Philadelphia, PA 19174, on leave at the DigitalSystems Laboratory, Department of Electrical Engineering, StanfordUniversity, Stanford, CA 94305.

INTRODUCTION

COMPUTER arithmetic continues to be an active re-search field. Research activities are motivated pri-

marily by the computational cost and performance re-quirements and by the availability of large-scale integra-tion (LSI) which permits the reliable implementation oflarge numbers of basic components. Major contributionsduring the past decade have been made in many areas ofcomputer arithmetic. Among these are the following.The determination of the theoretical limits on the speed

of arithmetic and logical computation. Early work in thisarea was done by Winograd. An excellent article on thesubject is Spira [18].

Floating-point arithmetic is almost universally em-ployed but is at best an inaccurate computational proce-dure. Problems associated with the use of this arithmeticas implemented in various computing machines havemotivated analyses of floating-point arithmetic and studiesof how this arithmetic should be implemented.Nonstandard number systems continue to be studied.

Such studies are in general motivated by the need forcomputational speed, error correction, or for the preven-tion of round-off error. The need for computational speedhas also motivated considerable research in the hardwareimplementation of algorithms for the generation of theelementary functions. Also the quest for speed has lead to-the development of the concept of pipelining and this

1277

multiprocessing systems - ieee computer society · baer: multiprocessing systems main memory (160...

Documents